Cloud Computing - Planning for the non-promise of High Availability
More and more businesses today are experimenting with moving their applications into the cloud - a key motivation being, to be able to leverage highly available and redundant cloud infrastructure at reduced IT costs. However, sporadic failures over this year in cloud services provided by the cloud computing Big-3 have indicated that companies have to factor-in downtime in their application deployment strategy, irrespective of the advertised promises of high availability. A failure in Amazon's EC2 cloud service, early this year, had resulted in many internet sites being down throughout the day as per a report here. Microsoft's Office 365 cloud service and Google's Gmail and Appservices too have had their share of downtimes this year.
For architects of such systems, the promise of the 'cloud providing scalable and available infrastructure' should not distract from the need to provide importance in equal measure to both application and infrastructure architecture. Even when designing applications to be hosted on the cloud, it is important to understand infrastructure requirements thoroughly and design for the same. Though there are promises of redundancy and data replication in the cloud, the overall architecture must clearly outline how services must be redundantly available across zones (multiple physical locations offshore), how data should be replicated across zones. Cloud service providers do offer such services, but it is up to the design of the application to utilize those services. During the Amazon outage, some clients managed to shift their traffic to Amazon's West Coast data center when they found servers (hosted on the East Coast) failing because they had designed for such failures. In a project involving migration of an application from on-premise systems to Microsoft Azure, our customer had some very intelligent questions about the possibility of failure and how, the system we were designing required to auto-correct before users realized. The intelligence had to be built-in, despite the option of paying for redundancy.
The uncertainty witnessed in recent times regarding downtime on the cloud and the resulting revenue losses to businesses have resulted in them spending more towards building redundancy (by booking more compute, storage and network resources in the cloud). As a result, the cost of using the cloud increases, because business can no longer ignore redundancy costs. (The challenge over time is that the investments have to last longer to cover for the costs incurred towards buidling in redundancy despite being on the cloud.) Companies craving 100% availability must bear premium costs for redundant computing and storage environments. (Cloud computing providers provide various levels of service - Amazon AWS gold plan, platinum plan etc.) With recent scientific evidence of potential natural catastrophes like earthquakes etc. hitting wider regions, it is important to plan for deployment across multiple zones (across landmasses if required). Remember, there is of course cost and risks involved. (For example, bandwidth cost for communication among compute roles in the same data center is free, while communication across data centers is charged. Additionally, data in data centers across countries subjects you to being under the jurisdiction of that country).
On the other side, cloud providers should be open towards questions from customer architects as regards their infrastructure design and spread. Cloud providers should be open to compliance audits from their customers or from external third party audits including certifications (SAS 70 security reviews and ISO 27001) to increase confidence in the customer. Certifications as a result of audits help increase customer confidence, that nothing is drastically wrong and the chance of failure is minimal and recovery covered.
Cloud providers are indeed realizing the need for greater service transparency. In fact, Microsoft Azure has invested in service dashboards, monitoring services and notification services, using which users can gain visibility into service availability. Using these services, customers can see the state of the infrastructure and services on which their systems run at realtime.
Companies (consuming cloud services) should realize that the services should not be considered to be a Holy Black Box, where everything will just work forever. They have to plan for strategies to counter infrastructure failures despite using the cloud. Customers should build in resilience and understand trade-offs. Cloud Providers should realize that customers are extremely concerned with regard to availability promises on paper and SLAs. It would be good to develop customer confidence by opening up to audits and certifications. Interestingly, the Asia Cloud Computing Association (ACCA) plans to release guidelines to rank cloud providers in availability and seven other categories, to allow uniformity in comparison.