Is Cloud Computing doomed or did EC2 “over optimise”?
Something really interesting has been happening over the last few weeks- the latency on Amazon’s EC2 cloud has been increasing. Check out this graph from Cloudkick:

Why does it matter?
The time distance two machines are away from one another drastically effects the performance of any service orientated architecture. Whether it is simply connecting to the database, or calling a webservice, the time it takes for a packet of data to move over the network is a delay you have no choice but to swallow. In normal circumstances, a ping between two internal nodes within Amazon is around the 0.3ms level, with the odd ping reporting a whopping 7ms ever 30 or so packets. Completely within operational parameters and what you would expect within an internal network.
We have discovered though, that when our instances appear to be dying or at least shaky, then this network latency jumps up to a whopping 7241ms (yes, 7 seconds to move a packet around internally).
Alan Williamson posted. He goes on:
As you can appreciate, this has some considerable knock-on effects to the rest of our system. Everything grinds to a halt. Now I do not believe for a moment, this is the real network delay, but more likely the virtual operating system under extreme load and not able to process the network queue. This is evident from the fact that many of the pings never came back at all.
Then someone reminded me when Amazon launched spot pricing- err about the 14th of December. Now have a look at that graph again:

A week after the launch of spot pricing, latency increases massively and doesn’t come back.
It feels like the cloud is having trouble scaling…
With spot pricing Amazon has managed to oversubscribe their network:
Here’s the kicker: over subscription is not the same thing as over capacity. BY DESIGN, modern data/telecommuication (and Cloud) networks are built using an over-subscription model.
On the other hand, the sad truth is that we will have over capacity issues in cloud; it’s simply a sad intersection of the laws of physics and the delicate balance associated with cost control and service delivery.
Let me frame the following with an example: when you purchase an “unlimited data plan” from a telco or hosting company, you’ll notice normally that this does not have latency or throughput figures attached to it…same with Cloud. You shouldn’t be surprised by this. If you are, you might want to rethink your approach to service level expectation.
Hoff concludes:
So, wrapping this up, I have to accept AWS’ statement that they “…do not have over-capacity issues,” because quite frankly there’s nothing to suggest otherwise. That’s not to say there aren’t performance issues are related to something else (like software or hardware in the stack) but that’s not the same as being over capacity — and you’ll notice that they didn’t say they were not “over-subscribed” but rather they were not “over capacity.”
/Hoff
*Just ask AT&T about their network and the iPhone. This *is* a case where their over-subscription planning failed in the face of capacity…and continues to.
Ah back to giving A&T a hard time- happy days!
Anyway- what Amazon should have done is priced their spot instances in line with performance expected from their existing customers, not only in line with demand as they have finite supply at the moment. Check out the nifty Cloud broker app which is tracking the spot prices to give you an understanding of the price differential between standard, reserved and spot prices.
The next thing I think Amazon needs to do is start becoming a marketplace for compute. They’ll make much more money on the marketplace than having to do it all by themselves and will gain access to much more supply of compute. If I were in their team I would be working out how to give their cloud operating system for free to as many providers as possible and introduce a marketplace. Then focus on innovating at the top of the stack and keeping customers happy and buying from them.
Despite the doom laden predictions- the cloud isn’t doomed. We do however need to pay more attention to the detail. IaaS is not the panacea in the long term- PaaS is. In the mean time we’ll continue to struggle towards PaaS with legacy apps on IaaS and the problems they bring.
My 2 cents worth anyway
