The commoditization of technology has reached its pinnacle with the advent of the recent paradigm of Cloud Computing. Infosys Cloud Computing blog is a platform to exchange thoughts, ideas and opinions with Infosys experts on Cloud Computing

« The Intelligent Guard Who Detects Threat in Cloud- AWS GuardDuty | Main | AWS ECS Anywhere: Run Container Workloads in Hybrid Environments »

Achieving Business Resiliency on Cloud

Enterprises have adopted cloud-driven transformation in their digital journey to realize the typical benefits of agile and scalable infrastructure platform, cost takeout and flexible commercial consumption models. There is increased focus on business resiliency as the next generation of cloud adoption to deliver on the business outcomes by improving service availability, thereby enhancing customer experience.

Has your Enterprise adopted a Resilient and Robust system for your business critical applications? If so, has your investment protected you from business failures and ensured quick recovery during outage situations?

According to 1IDC Survey, 87% of the businesses wait up to four hours for support when an outage occurs. An infrastructure failure costs $100,000 per hour while a critical application failure for a Large Enterprise costs a staggering $500,000 to $1 million per hour.

 

The unplanned downtime adversely impacts the business goals and customer experience. It does require great effort and expertise to reduce downtime to 0.976 hrs (99.99%) from 43.8 hrs (99.5%).

Recently, there was a four hour outage at Leading Stock Exchange due to telecom links failure that rendered the online risk management system unavailable and due to interoperability failure, trading had to be suspended and led to huge losses to investors.

Achieving Business Resiliency and ability to provide service continuity during an outage or crisis situation is of paramount importance to organizations. Meeting the required uptime is possible only when all the components and layers of the application service delivery including infrastructure layers are mapped, designed, and comprehensively tested for resiliency and reliability.

Resiliency is the ability of the service to recover fully and quickly within the desired time from failure of any system component and provide service continuity albeit degraded performance for short time until recovery.

More than two-third of the organizations are yet to implement effective resiliency measures. Let's look at the common reasons on why outages still occur and how these can be avoided.

Design not aligned and/or re-visited often to deliver on the Service Level Objectives.

What you have is not what you need. Often, system architecture is designed based on component level availability such as compute and not considering the overall business SLAs. Example, 99.9% availability of two components in the application line of dependencies provides net 99.8% availability and not 99.9% service uptime. High-availability design with multiple cloud zones doesn't necessarily mean a resilient infrastructure. The typical resiliency parameters such as Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are not tested enough and consistently, even if defined to meet the business uptime requirements.

What you had may not suffice now. The design parameters and architecture are often not re-visited after initial setup. Example, Database with provisioned throughput started throwing exceptions 2-3 months after the initial commissioning. Troubleshooting revealed additional batch jobs were submitted on the same instance and the minimum Read Capacity Unit (RCU) had to be increased to handle the higher workload.

Network Connectivity is one of the common causes of disruptions. To optimize costs, many customers plan for a single primary dedicated Cloud Connectivity Link (Direct Connect/Express Route/Cloud Connect) and VPN as backup link even when the required connection is more than 1 Gbps. VPN has limitations such as 1.25 Gbps bandwidth and support on Equal Cost Multi Path (ECMP) for data egress path. This link is unable to sustain the transaction workload if the primary connection is unavailable.

Interoperability and dependency testing not carried out with other business applications

Change in one application can impact or break another application! While unit and integration tests are done as part of production releases for an application, the test cases are mostly manual and do not adequately simulate the failure scenarios of the dependent components from another application. Often, the inter-application dependencies are overlooked, and testing is limited to test cases specific to the individual application.

Application focused testing instead of Business Centric Testing

Not every application release goes through the full comprehensive testing cycle to verify the business process resiliency. The integration testing often doesn't include testing various layers of the technology stack related to the business process. I have seen testing infrastructure to assess any impact due to application changes missing from the plan and hence sometimes leads to performance degradation after release. Example, changes to the design and size of embedded images on page can impact the encryption/decryption time during page rendering and transactions, and hence SSL device sizing should be verified.

Continue using the legacy monitoring framework

Monitoring agents are certainly set up at various layers such as servers, network, databases, applications. But often the metrics to monitor, tools strategy and the right tool selection to measure metrics, measurement frequency and soft limits to trigger automated alarms and self-healing actions are not defined. Cloud has so much potential for integrated monitoring and unified dashboard that are under-utilized in most of the cloud operations.

How can we achieve business resiliency on Cloud? Let's look at A2F Framework that should be adopted at high priority to minimize downtime risks and to quickly recover to the normal operational state in outage situations.

Architect and Design for Resiliency in the initial stage of cloud adoption

We can implement a more robust resilient infrastructure by utilizing blueprints and reference architectures. Hyperscalers such as AWS have set up AWS Architecture Center and Well Architected Labs to refer to vetted architecture and solutions and access to best practices with AWS Well Architected Framework. Service Providers such as Infosys offer Infosys Cobalt, a set of services, solutions and platforms to accelerate setup of a high-available and resilient infrastructure platform and also is a certified Well Architected Review Partner to evaluate production workloads and recommend for better customer outcomes. Organizations are adopting more of the Cloud Native Services for running the workloads - be it re-designing the application architecture with Container As a Service (CaaS), Platform As a Service (PaaS), Serverless or establishing operational processes for backup, monitoring, patch management etc. replacing disparate third-party tools/products in the legacy environments.

Public Cloud connectivity should be designed and set up with separate dedicated links on separate devices in different locations with a dual-vendor approach to eliminate single point of failure (SPOF) and maximize resiliency. For non-critical and/or non-production workloads requiring less stringent uptime, redundancy in design vs cost decision can be taken.

Business Centric Testing

The test plan should be derived to validate the service level objectives and the business user experience, going beyond the application functionality validation. In addition to simulating production-like performance testing on a staging area, there are end user experience monitoring tools available natively from Hyperscalers as well as third-party tools that should be leveraged to check for latency and response times.

Comprehensive Testing Approach

There is a need to recalibrate business availability test cases by not only identifying inter-application dependencies but also validating the application continuity by simulating failures at various layers. Hyperscalers such as AWS have released new services such as AWS Fault Injection Simulator (Chaos Engineering as a Service) with pre-built experiment templates that we can leverage and run game days, to improve the application's resiliency and observability, in addition to the existing manual test cases. The test results should be analyzed and measures taken to self-heal and recover autonomously.

DevOps based Release Management

Teams should implement DevOps for continuous build, testing and releases. We should plan for small changes and continuously test the changes. We can leverage Blue-Green and Canary deployment features in Cloud to test the changes with a small set of users, monitor the performance and gradually roll out to larger user base. Any performance degradation or functionality issues due to faulty code can be detected early and rolled back.

Engineer Observability Platform for preventive monitoring with self-healing capability

It is essential to map business SLOs for the set of applications into every dependent component availability. We need to define the business tolerance and accordingly identify the metrics to measure, the right tools that can monitor proactively and set soft limits to trigger self-healing actions where possible before the service degrades to unacceptable response times. 'One tool to monitor everything' strategy will fail. There are specialized tools for deeper monitoring and insights that should be set up based on required metric collection.

The cloud platform should essay the role of monitor of monitors and provide a single reporting dashboard. AWS services such as AWS Management Console, AWS Service Health Dashboard, CloudWatch Dashboards, Custom Reports with QuickSight and ability to integrate third-party tools should be leveraged for effective monitoring. Service providers add a lot of value by packaging services with custom solutions for observability platform.

Fail Fast and Learn Fast

Do not create an architecture astronaut. Instead, we should design for evolutionary modular architecture that can adapt to required changes based on learnings from simulation tests and help maintain a dynamic and responsive infrastructure. Organizations should institutionalize the change management process to implement changes and feedback from continuous testing cycles.

 

References: 1DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified," by IDC Vice President Stephen Elliot.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.