The commoditization of technology has reached its pinnacle with the advent of the recent paradigm of Cloud Computing. Infosys Cloud Computing blog is a platform to exchange thoughts, ideas and opinions with Infosys experts on Cloud Computing

« October 2020 | Main | June 2021 »

April 28, 2021

Achieving Business Resiliency on Cloud

Enterprises have adopted cloud-driven transformation in their digital journey to realize the typical benefits of agile and scalable infrastructure platform, cost takeout and flexible commercial consumption models. There is increased focus on business resiliency as the next generation of cloud adoption to deliver on the business outcomes by improving service availability, thereby enhancing customer experience.

Has your Enterprise adopted a Resilient and Robust system for your business critical applications? If so, has your investment protected you from business failures and ensured quick recovery during outage situations?

According to 1IDC Survey, 87% of the businesses wait up to four hours for support when an outage occurs. An infrastructure failure costs $100,000 per hour while a critical application failure for a Large Enterprise costs a staggering $500,000 to $1 million per hour.

 

The unplanned downtime adversely impacts the business goals and customer experience. It does require great effort and expertise to reduce downtime to 0.976 hrs (99.99%) from 43.8 hrs (99.5%).

Recently, there was a four hour outage at Leading Stock Exchange due to telecom links failure that rendered the online risk management system unavailable and due to interoperability failure, trading had to be suspended and led to huge losses to investors.

Achieving Business Resiliency and ability to provide service continuity during an outage or crisis situation is of paramount importance to organizations. Meeting the required uptime is possible only when all the components and layers of the application service delivery including infrastructure layers are mapped, designed, and comprehensively tested for resiliency and reliability.

Resiliency is the ability of the service to recover fully and quickly within the desired time from failure of any system component and provide service continuity albeit degraded performance for short time until recovery.

More than two-third of the organizations are yet to implement effective resiliency measures. Let's look at the common reasons on why outages still occur and how these can be avoided.

Design not aligned and/or re-visited often to deliver on the Service Level Objectives.

What you have is not what you need. Often, system architecture is designed based on component level availability such as compute and not considering the overall business SLAs. Example, 99.9% availability of two components in the application line of dependencies provides net 99.8% availability and not 99.9% service uptime. High-availability design with multiple cloud zones doesn't necessarily mean a resilient infrastructure. The typical resiliency parameters such as Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are not tested enough and consistently, even if defined to meet the business uptime requirements.

What you had may not suffice now. The design parameters and architecture are often not re-visited after initial setup. Example, Database with provisioned throughput started throwing exceptions 2-3 months after the initial commissioning. Troubleshooting revealed additional batch jobs were submitted on the same instance and the minimum Read Capacity Unit (RCU) had to be increased to handle the higher workload.

Network Connectivity is one of the common causes of disruptions. To optimize costs, many customers plan for a single primary dedicated Cloud Connectivity Link (Direct Connect/Express Route/Cloud Connect) and VPN as backup link even when the required connection is more than 1 Gbps. VPN has limitations such as 1.25 Gbps bandwidth and support on Equal Cost Multi Path (ECMP) for data egress path. This link is unable to sustain the transaction workload if the primary connection is unavailable.

Interoperability and dependency testing not carried out with other business applications

Change in one application can impact or break another application! While unit and integration tests are done as part of production releases for an application, the test cases are mostly manual and do not adequately simulate the failure scenarios of the dependent components from another application. Often, the inter-application dependencies are overlooked, and testing is limited to test cases specific to the individual application.

Application focused testing instead of Business Centric Testing

Not every application release goes through the full comprehensive testing cycle to verify the business process resiliency. The integration testing often doesn't include testing various layers of the technology stack related to the business process. I have seen testing infrastructure to assess any impact due to application changes missing from the plan and hence sometimes leads to performance degradation after release. Example, changes to the design and size of embedded images on page can impact the encryption/decryption time during page rendering and transactions, and hence SSL device sizing should be verified.

Continue using the legacy monitoring framework

Monitoring agents are certainly set up at various layers such as servers, network, databases, applications. But often the metrics to monitor, tools strategy and the right tool selection to measure metrics, measurement frequency and soft limits to trigger automated alarms and self-healing actions are not defined. Cloud has so much potential for integrated monitoring and unified dashboard that are under-utilized in most of the cloud operations.

How can we achieve business resiliency on Cloud? Let's look at A2F Framework that should be adopted at high priority to minimize downtime risks and to quickly recover to the normal operational state in outage situations.

Architect and Design for Resiliency in the initial stage of cloud adoption

We can implement a more robust resilient infrastructure by utilizing blueprints and reference architectures. Hyperscalers such as AWS have set up AWS Architecture Center and Well Architected Labs to refer to vetted architecture and solutions and access to best practices with AWS Well Architected Framework. Service Providers such as Infosys offer Infosys Cobalt, a set of services, solutions and platforms to accelerate setup of a high-available and resilient infrastructure platform and also is a certified Well Architected Review Partner to evaluate production workloads and recommend for better customer outcomes. Organizations are adopting more of the Cloud Native Services for running the workloads - be it re-designing the application architecture with Container As a Service (CaaS), Platform As a Service (PaaS), Serverless or establishing operational processes for backup, monitoring, patch management etc. replacing disparate third-party tools/products in the legacy environments.

Public Cloud connectivity should be designed and set up with separate dedicated links on separate devices in different locations with a dual-vendor approach to eliminate single point of failure (SPOF) and maximize resiliency. For non-critical and/or non-production workloads requiring less stringent uptime, redundancy in design vs cost decision can be taken.

Business Centric Testing

The test plan should be derived to validate the service level objectives and the business user experience, going beyond the application functionality validation. In addition to simulating production-like performance testing on a staging area, there are end user experience monitoring tools available natively from Hyperscalers as well as third-party tools that should be leveraged to check for latency and response times.

Comprehensive Testing Approach

There is a need to recalibrate business availability test cases by not only identifying inter-application dependencies but also validating the application continuity by simulating failures at various layers. Hyperscalers such as AWS have released new services such as AWS Fault Injection Simulator (Chaos Engineering as a Service) with pre-built experiment templates that we can leverage and run game days, to improve the application's resiliency and observability, in addition to the existing manual test cases. The test results should be analyzed and measures taken to self-heal and recover autonomously.

DevOps based Release Management

Teams should implement DevOps for continuous build, testing and releases. We should plan for small changes and continuously test the changes. We can leverage Blue-Green and Canary deployment features in Cloud to test the changes with a small set of users, monitor the performance and gradually roll out to larger user base. Any performance degradation or functionality issues due to faulty code can be detected early and rolled back.

Engineer Observability Platform for preventive monitoring with self-healing capability

It is essential to map business SLOs for the set of applications into every dependent component availability. We need to define the business tolerance and accordingly identify the metrics to measure, the right tools that can monitor proactively and set soft limits to trigger self-healing actions where possible before the service degrades to unacceptable response times. 'One tool to monitor everything' strategy will fail. There are specialized tools for deeper monitoring and insights that should be set up based on required metric collection.

The cloud platform should essay the role of monitor of monitors and provide a single reporting dashboard. AWS services such as AWS Management Console, AWS Service Health Dashboard, CloudWatch Dashboards, Custom Reports with QuickSight and ability to integrate third-party tools should be leveraged for effective monitoring. Service providers add a lot of value by packaging services with custom solutions for observability platform.

Fail Fast and Learn Fast

Do not create an architecture astronaut. Instead, we should design for evolutionary modular architecture that can adapt to required changes based on learnings from simulation tests and help maintain a dynamic and responsive infrastructure. Organizations should institutionalize the change management process to implement changes and feedback from continuous testing cycles.

 

References: 1DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified," by IDC Vice President Stephen Elliot.

April 5, 2021

The Intelligent Guard Who Detects Threat in Cloud- AWS GuardDuty

With surge in globally connected systems and cloud computing, lot of sensitive data is stored and processed which makes it more important than ever for organizations to focus on protecting it from increasingly sophisticated cyber-attacks.

To detect threats and protect infrastructure as well as workloads, one has to deploy additional software and additional infrastructure with appliances, sensors and agents, then set them up across all accounts, then continuously monitor and protect those accounts. It means collecting and analyzing tremendous amount of data. Then accurately detect threats based on data analysis, prioritize, and respond to alerts. And when all this is required at scale, we need to ensure that business functions and environments are not disrupted or impede the flexibility in cloud. 

This requires lot of expertise, time and upfront cost. There are some third-party managed tools available like CheckPoint CloudGuard Dome9 and Paulo Alto Prisma cloud but they can be costly for small to medium scale environments and require specific skillset to deploy and manage.
AWS GuardDuty is a cloud scale, easier, smarter and cost effective managed intelligent threat detection and notification service to protect AWS environments and workloads.

It is a managed service that constantly monitors AWS environment to find unusual or malicious behavior, filter out noise and prioritize critical findings. Or in other words, helps in finding the needle in haystack so that security team can focus on hardening AWS environment and quickly respond to suspicious or malicious activity or behavior.

In the context of NIST framework for cloud security, it fits under "detect" as it is AWS's primary threat detection tool. 



Most of the threat detection services focus on network traffic to identify malicious activities however GuardDuty also analyses unusual API calls and potential unauthorized deployments indicating possibly compromised account and instances within AWS to detect anomalies. 

AWS GuardDuty analyses output from three primary data sources to detect threats. They are VPC flow logs, DNS logs and AWS CloudTrail and then applies machine learning, anomaly detection and integrated threat intelligence across multiple data sources to identify, prioritize and notify potential threats.


Enabling VPC flow logs for a large environment can be very expensive. The good news is that GuardDuty doesn't require any of these services to be enabled rather the data and logs required are gathered through an independent channel in backend directly from these services. So as soon as GuardDuty is enabled, a parallel stream of data feeds into GuardDuty backend. 

Following are few characteristics of GuardDuty :- 
 
  • Simplicity - There is no architectural or performance impact of enabling GuardDuty on existing environment.
  • Continuous monitoring of AWS account and resources - Since there is no agent to be installed, as soon as any resource is created in a region protected by GuardDuty it is automatically covered.
  • GuardDuty detects known threats like API calls coming from known malicious IP addresses based on threat intelligence from various up to date sources like AWS security intelligence, CrowdStrike and Proofpoint. It also detects unknown threats like unusual data access, mining of crypto-currency based on machine learning and behaviour of users as well as instances.


GuardDuty findings are classified as either stateless, which are independent of server or service state like IP match to a known malicious IP address or stateful which are more of behavioural detections that require state of EC2 instance or IAM user or role to be contained to analyse deviation from usual behaviour.


These findings are segregated into high, medium and low severity levels based on the threat severity value associated with them. This threat severity value defined by AWS reflects the potential risk with each finding. Severity value falls between 0.1 to 8.9 with higher the value greater is the risk. AWS has reserved values 0 and 9 to 10 for future use.

Severity Level

Associated Severity Value

Implication

High

8.9 - 7.0

Resource compromised. Immediate action needed.

Medium

6.9 - 4.0

Deviation from normal behaviour observed. Further investigation needed to acertain resource compromise.

Low

3.9 - 1.0

Suspicious activity attempted or a failed attack. No immediate response recommended


GuardDuty supports master and member account structure. Many member accounts can be associated to a master account enabling enterprise wide consolidation and management. So, in a large environment, hundreds of member accounts can be associated to a master account. While individual teams or account owners can look at the findings within their own account, centralized security team only need to look at master account to get the wholistic view and also, they can create policies applicable across accounts like IP whitelisting and suppression filtering of certain findings as well as prevent individual accounts to apply independent policies. This way the control is in the hands of centralized security team.



Filtering noise: While GuardDuty provides important insights, it is equally important to segregate between genuine and insignificant alerts and prioritize accordingly. Some of the alerts require immediate response, however many of them are worth ignoring as these can be an unnecessary panic and overhead.  For example, false positive alarm, but they are very rare. Another example is an alert generated for an activity which is expected, like a vulnerability scanning software deployed for port scaling will result in a "port scanning" finding when it performs scanning. While this alert is genuine, it is an expected activity. Or port scanning from a non-malicious IP on the intentionally opened port of a web server or SSH on the bastion. In such scenarios either not much can be done or if the risk is accepted, the user may want to avoid getting notifications.

The solution is to create automatic filters by creating "suppression rules". When suppression rule is created the findings are still listed in GuardDuty console but it is not sent to CloudWatch event to avoid any downstream action.

While only master GuardDuty account can create suppression filters, those are automatically applied to all member accounts. This allows centralized security team to control suppression across enterprise and also reduce efforts required in applying them in individual accounts. 

Conclusion- GuardDuty service is very quick and easy to deploy. Though it takes only AWS services logs into account but it still generates lots of valuable information to avoid possible attacks and helps in zeroing down to compromised server during cyber forensic activities.






Enable Enterprise self-service through AWS ASM connector for ServiceNow

Last decade of cloud computing was all about rushing to cloud to reap the benefits of being economical, agile, flexible, scalable, reliable, elastic and speed of deployment. But now clients want more, they are looking for cloud 2.0 where they look for consistent and repeatable experience when they scale by establishing right level of controls without slowing down the innovation.

This translates into cloud operations requirements like:-

  • Enable enterprise to achieve self-service through automated resource provisioning in standardize manner and integration with enterprise ITSM systems.
  • Manage proactive central governance and security.

 But there are some challenges like:-

  • There are too many operational tools in organizations.
  • Different ownership for ITSM tools and AWS platform. 
  • Lengthy procedure and lot of time taken for resources from placing the request to get it provisioned and there are several hand-offs  


The solution is to have an integration between ITSM tools like JIRA, ServiceNow and AWS platform. For ServiceNow it can be achieved using AWS service management connector. This ASM connector enables integration features for AWS Service Catalog, AWS Config and AWS Systems Manager within ServiceNow which enable end users to provision, manage and operate AWS resources natively through ServiceNow. This helps enterprises to achieve integrated and streamlined resource provisioning and management process where cloud resources are also ordered, provisioned and removed like other IT assets. 

Below are some of the features of this solution: -

  • Support AWS Service Catalog portfolios and products to enable ServiceNow users to request, update and terminate AWS products via ServiceNow Service Catalog.
  • Support for AWS Config configuration items of provisioned products to enable end users to see resource details and relationships via ServiceNow CMDB
  • Support for AWS Systems Manager automation documents to enable end users to request permitted automation playbooks on AWS resources via ServiceNow
  • Freely available in the ServiceNow Store for Orlando, New York and Madrid platform releases. Only ITSM module of SNOW is required to leverage it.



Below will be the experience of ServiceNow end user:-

  • End user is able to browse AWS products and portfolio in ServiceNow interface which are synced in SNOW from AWS service catalog via API's. 
  • User then place a service request like any other IT resource. 
  • This request then follows a pre-defined approval workflow and moves to approver. Once approved, it triggers AWS service catalog to provision requested AWS resources in AWS platform.
  • There is a clear segregation of responsibilities where SNOW team is responsible to manage the SNOW side integration as well as interface and AWS platform team is responsible for CloudFormation templates, publishing catalog products and govern AWS resources.
  • Since CloudFormation template is used in the backend, almost all AWS services can be ordered and provisioned via this integration. 
  • They are immutable, which means end user can't alter them or they can be requested as is only. 
  • AWS administrator can standardize best practice and enforce compliance by putting some constraints, perimeter validation, IAM assignment, tag enforcement, EBS encryption, specific security group etc to name a few.
  • Once request is fulfilled and AWS resources are provisioned a notification is triggered and they are available for end user to access them.
  • Workflow completes, end user gets required AWS resource and will also be able to perform certain self-service actions like update, start, stop, terminate and reboot AWS resources.

The advantage of this workflow is that users do not need to be cloud experts. They need not be concerned about governance and compliance which needs to be followed while consuming cloud services. Even they do not need to have the access to that specific AWS service being ordered in AWS platform. They have pre-defined and pre-approved set of AWS services to be leveraged meeting all enterprise requirements.

If client is looking to accelerate cloud adoption, Agility, faster time to market and boost experimentation and innovation while adhering to compliance and security requirements, this solution is worth looking into.

Reference- Amazon Web Service