Viodi View – The Bridge Between the Heartland and Silicon Valley

Amazon’s EC2 Outage Proves Cloud Failure Recovery is a Myth!

Amazon’s Cloud Computing failures began early Thursday morning and continue into Friday April 22nd. Affected web sites included Quora.com, Reddit.com, GroupMe.com and Scvngr.com, which all posted messages to their visitors about the issue. Most of those web sites have been inaccessible for hours, and others were only partly operational.

Companies use Amazon’s cloud-based service (known as Elastic Cloud or “EC2”) to host their Web sites, applications and for data storage. Amazon’s customers include start-ups like the social networking site Foursquare but also big companies like Pfizer, Netflix and Nasdaq. EC2 is designed to cope with giant traffic spikes, of the type Amazon experiences during its pre-Christmas shopping rush in December of each year. Today, Amazon said that a networking glitch made its storage volumes automatically create back-ups of themselves, filling up storage capacity and causing connectivity issues.

For several years, we’ve been pounding the table about many of the potential problems that Cloud Service Providers have “swept under the rug” or ignored. These include failure/ disaster recovery, SLAs, security, lack of standards, vendor lock-in, etc. During the session “Everyone can now afford a Disaster Recovery Center*” at the 2011 Cloud Connect conference, the speaker stated that disaster recovery could be solved by transferring the work loads from effected cloud data centers to other data centers owned by the same Cloud Service Provider. He gave an example where cloud outages in San Francisco, CA resulted in the jobs transferred to cloud data centers in London, England with data being replicated there. I strongly challenged the speaker about the complexity of doing this. In particular, the resulting extra processing load on the servers in London, data replication issues and network bandwidth saturation. He danced around those problems and remained confident disaster recovery would actually be a strength, rather than a weekness of cloud computing. Wonder what he thinks now after the Amazon cloud outages?

* The session, “Everyone Can Now Afford a Disaster Recovery (DR) Center,” was supposed to detail the ways in which cloud computing has disrupted the cost dynamics of disaster recovery. “The economics of cloud computing have changed the disaster recovery game, allowing everyone to afford a DR center and pay for DR services only when they are needed. Attendees of this session will learn about new strategies for data protection and disaster recovery in the Cloud.”

“Cloud computing is revolutionizing Disaster Recovery,” said Dr. Ian Howells, CMO at StorSimple. “The natural advantages of the cloud being available from anywhere with high availability, elasticity and utility billing make it ideal for next generation Disaster Recovery strategies that are now affordable for widespread usage. What is needed is a framework to optimize content, data and application movement between the cloud and on-premises infrastructure.”

After Amazon’s cloud went dark, one location-based service proider (SCVNGR), tweeted: ‘The sky is falling! Amazon’s cloud seems to be down (raining?) so we’re experiencing some issues too. Be back soon!’ Four Square and a number of other social media sites hosted by Amazon’s cloud were also forced to post apology notices

Here are a few comments from Executives that have considered EC2 and cloud computing:

“We don’t think the cloud is enterprise-ready,” said Jimmy Tam, general manager of Peer Software, which provides data backup for businesses. “Are you really going to trust your corporate jewels to these cloud providers?”
“Clearly you’re not in control of your data, your information,” said Campbell McKellar, founder of Loosecubes, a Web site for finding temporary workspace that was not available Thursday. “It’s a major business interruption. I’m getting business interruption insurance tomorrow, believe me, and maybe we get a different cloud provider as a backup.”
Ben Parr from Mashable pointed out, the event revealed that Amazon’s cloud redundancies failed to stop a mass outage. “Its Availability Zones are supposed to be able to fail independently without bringing the whole system down. Instead, there was a single point of failure that shouldn’t have been there,” said Parr.
Cheezburger Network CEO Ben Huh said that outages like this can be a learning opportunity for companies. “It’s not a catastrophe unless something valuable (like user data) was lost,” said Huh. “It’s an opportunity to learn about the service provider’s weakness and how to design more stable, reliable systems. Services recover very quickly from outages as long as they are relatively short. Long-term outages are another beast.”

We could only find one voice that extolled the benefits of cloud computing, in light of Amazon’s massive failure:

“The benefits of the cloud are significant,” said Jeff Janer, chief executive of Springpad, a service that people use to save items online, which went offline as a result of Amazon’s problem. “Amazon as a resource for a company like ours makes an awful lot of sense. We’re just all keeping our fingers crossed that they get back as quickly as possible.”

We invite comments from Viodi View readers regarding their experiences and/or opinions about cloud failure recovery and other vulnerabilities.

For the latest status on Amazon’s Web Services by region, please visit their Service Health Dashboard: http://status.aws.amazon.com/

Author Alan Weissberger

View all posts by Alan Weissberger | Website

Posted

April 22, 2011

Alan Weissberger, Cloud Computing

Alan Weissberger

Tags:

amazon, cloud computing

Comments

14 responses to “Amazon’s EC2 Outage Proves Cloud Failure Recovery is a Myth!”

Alan Weissberger

April 22, 2011

Cloud Security is even a bigger risk than Cloud failure recovery. IMHO, there is no real security in the cloud. Threats and vulnerabilities are emerging faster than they can be detected and there is no viable solutions yet! For the latest on Cloud Security see article + comments at: http://community.comsoc.org/blogs/ajwdct/demythifying-cyber-security-ieee-comsocscv-april-19th-meeting-summary

Loading…

Reply
William Hugh Murray

April 22, 2011

Murphy strikes again. This is not a failure mode that I would have had on my list.

I am reminded of NASDAQ Trumbull. Squirrel trips transformer. External power goes off. Battery takes the load. Engine generators kick in. Not so much as a hiccup. Hour and a half later external power comes on. Controls go into deadly embrace. System fails and will not come back up. (Application moved to alternate site in Rockville MY.)

Every list of failure modes must end in “other.”. One can only hope that the strategy for the explicit items on the list will also work for other.

While I agree that IT does not do SLAs well, this problem is not specific to the cloud.

Loading…

Reply
anonymous

April 22, 2011

Thanks for a very insightful article. Especially enjoyed your story about the Disaster Recovery session at 2011 Cloud Connect conference

Loading…

Reply
IEEE Discussion List Member

April 22, 2011

There are too many variables involved in any cloud environment to have certainty, with any statistical validity, that ALL threats are covered…very similar to the space shuttle survivability dilemma…so, SLAs will likely never be granted that have any teeth in them, or anywhere close to 5 nines of availability…just my thoughts.

Loading…

Reply
TiE Member

April 23, 2011

Great article! Does anyone seriously believe in cloud disaster recovery after Amazon’s EC2 failure with no effective backup to quickly restore cloud services?

2011 Cloud Connect Conference session:”Everyone Can Now Afford a Disaster Recovery Center” Ian Howells, CMO, StorSimple http://www.cloudconnectevent.com/cloud-computing-conference/data-and-storage.php

“The cloud is a once-in-a-decade disruption that completely changes the cost dynamics of disaster recovery. Cloud economics means everyone can now afford a disaster recovery center and pay for disaster recovery only when they need it. This presentation will look at high-availability and disaster recovery strategies that exploit the cloud and cloud economics. Topics covered will include:
-The evolution to cloud computingWhy use cloud computing
-Why use cloud computing for disaster recovery
-Cloud backup, snapshot and disaster recovery strategies
-A framework for a cloud disaster recovery”

Loading…

Reply
IEEE Discussion Group Member

April 24, 2011

Great article…Many questions arise:

What is the fallout/impact from Amazon’s massive Cloud Failure? Will it shake up Cloud Service Providers who’ve been extremely complacent? Have cloud users just gotten a wake up call and be more circumspect of Cloud Provider availability claims?

Loading…

Reply
1. Alan Weissberger
  
  April 24, 2011
  
  Some answers from the NY Times:
  
  “This is a wake-up call for cloud computing,” said Matthew Eastwood, an analyst for the research firm IDC, using the term for accessing services and information in big data centers remotely over the Internet from anywhere, as if the services were in a cloud. “It will force a conversation in the industry.”
  
  That discussion, he said, will most likely center on what data and computer operations to send off to the cloud and what to keep inside the corporate walls.
  
  But another issue, Mr. Eastwood said, will be a re-examination of the contracts that cover cloud services — how much to pay for backup and recovery services, including paying extra for data centers in different locations. That is because the companies that were apparently hit hardest by the Amazon interruption were start-ups that, analysts said, are focused on moving fast in pursuit of growth, and less apt to pay for extensive backup and recovery services.
  
  Amazon set up a side business five years ago offering computing resources to businesses from its network of sophisticated data centers. Today, the company is the early leader in the fast-growing business of cloud computing.
  
  Big companies, that have decided to put crucial operations on Amazon computers are apt to pay up for the equivalent of computing insurance, analysts say. Netflix, the movie rental site, has become a large customer of the Amazon cloud. Most of its Web technology — customer movie queues, search tools and the like — runs in Amazon data centers.
  
  Netflix said it had sailed through the last couple of days unscathed. “That’s because Netflix has taken full advantage of Amazon Web Services’ redundant cloud architecture,” which insures against technical malfunctions in any one location, said Steve Swasey, a Netflix spokesman.
  
  BigDoor, a 20-employee start-up in Seattle, was knocked down by Amazon’s travails. It had backup and recovery services with Amazon, said Keith Smith, the chief executive, but only at Amazon’s data center in Virginia. “There’s always a trade-off,” Mr. Smith said, noting the expenses and developer time that would have been required to do more.
  
  By Friday evening, most services at BigDoor, which makes game and rewards features for online publishers, were back up, but its Web site was still down.
  
  The long-term toll to cloud computing, if any, is uncertain. Corporate cloud computing is expected to grow rapidly, by more than 25 percent a year, to $55.5 billion by 2014, IDC estimates.
  
  Major technology suppliers are aggressively promoting different cloud offerings — some emphasizing a utility-style service, like Amazon, and others focusing more on selling big companies the hardware and software to more efficiently juggle computing workloads. The latter use the cloud technology, but companies own and control them — so-called private clouds.
  
  http://www.nytimes.com/2011/04/23/technology/23cloud.html?ref=technology
  
  Loading…
  
  Reply
Victor Grado

April 25, 2011

I think its too early too put all your eggs in one basket. Cloud technology is not mature enough at this point, with half-baked standards and practices, so caveat emptor. Great article, Alan.

Loading…

Reply
Greg

April 25, 2011

The cloud has issues, yes, but I fail to see how you’d survive a similar catastrophic failure to your own system, a system that would cost you much more to build. And even if you spent the $$$$$$$$$ to build a fault tolerant system, you don’t have any control over external factors… damage to a facility for example.

Loading…

Reply
Ken Pyle, Managing Editor

April 25, 2011

Alan, thanks for the timely and popular article.

I was absorbed in another project over the past couple of days and I think that project was affected by the described outage. I tried to place an ad in a newspaper and their site was down all day. After talking to a customer support person (yes, I picked up the phone), they suggested the service would be up later on Thursday and it was. Unfortunately, my advertisement never made it to the paper and could have cost us in sales at our Flea Market Fundraiser that we never made.

You have been expressing the concern about vulnerabilities inherent in the cloud for awhile now and this outage is a proof-point to your argument. As pointed out, this could happen with a home-grown system. I think the broader picture is the importance of back-up plans and data portability, regardless of regardless of approach. Of course, as we find out with so many complex systems, it is difficult to identify every possible point of failure.

Loading…

Reply
IEEE member

April 26, 2011

The problem originated in the Northern Virginia data center for Amazon’s EC2 (Elastic Cloud Compute) Web service when a network connection failure on Thursday morning triggered an automatic recovery mechanism, which also failed. Thousands of websites affected by the crash, which lasted a couple of days, included reddit, foursquare, HootSuite, BigDoor, and Quora. Some of the larger sites (such as foursquare) came back up as engineers frantically worked to restore services.

http://www.bizmology.com/2011/04/26/amazon-cloud-computing-outage-raises-questions/

Alan hit the problem on the head: network failure recovery FAILED! Keep pounding the table guy!

Loading…

Reply
Alan Weissberger

June 21, 2011

At the June 20-21, 2011 Cloud Leadershop Forum, no one seemed to be bothered by the Amazon cloud failure. The speaker from Century Link actually said that Cloud did not fail users and that Amazon had done a good job in explaining the outage. Come again? Am I dreaming?

Weren’t SOAP based web services and grid computing supposed to change everything – just 5 years ago? Yet they were both commercial failures.

While everyone knows that security is an impediment to cloud adoption, what about reliability, availability, fast failure recovery, performance (especially when the public Internet is used to delver Cloud Services? And what about the lack of standards which means there’s no interoperability amongst different cloud vendors? Or maybe vednor lock-in is not a problem?

Please give us a break from all this cloud hype. Instead, have the cloud service providers and vendors tackle the critical issues needed for cloud to be a successful business!

Loading…

Reply