Infosys Microsoft Alliance and Solutions blog

« Windows 8: Shutdown | Main | Understanding lifecycle of Windows 8 apps »

Considerations for Big Data processing using Cloud Computing

The data (structured and un-structured) influencing enterprise decisions is increasing exponentially every year. This data includes not only internal generated data within the enterprise but also external influencers such as social media, government regulations, external public data sets, etc.

The storage and compute capacity within the enterprise is usually limited and cannot scale or grow at the same rate as the data because of various reasons such as lead times to procure infrastructure, increased hardware and software costs to process data, etc. Many a times, this imbalance causes opportunity loss for enterprise for not being able to process the data in the necessary window.

Cloud Computing helps to resolve this in a very subtle manner as it provides a turnkey solution for on-demand network, compute and storage which essentially are the critical building blocks of any big data processing solution.

Dealing with large volume of data and reducing the latency of data processing are an important dimension in Big Data and architecting a solution on Cloud helps provide that OoB.
However, there are certain considerations that enterprises needs to make before considering Cloud (Public Cloud) as part of their Big Data solution, some of these are:

Regulations - Certain government regulations do not allow storing the local national data outside of the country geography. European Union doesn't allow storing data outside of their geography [1].HIPAA doesn't allow patient records to be stored outside of the patient residing geography [2].
However, such restrictions can be addressed by choosing the Cloud data center in the specific geography for storing such region sensitive information. Almost all the leading cloud vendors have data centers spanning across the world and provides the option of choosing the data center.

Data Security and Privacy -Not all Cloud providers provides transparency with the way they manage data privacy of their data centers. Though there are certain established international standard data center certification agencies, not all the vendor data centers are certified.  Appropriate compliance of the data center to these certifications should be verified based on the business needs. Microsoft has published a few papers to discuss the security framework and various certifications they comply for their data centers and are mentioned in END NOTES [3].

Cost Considerations - The storage cost of data on-premise could be significantly different than the storage cost of data off-premise based on the vendor and technical solution selected. Enterprise should carefully evaluate their storage strategies while architecting Big Data solution on cloud. E.g. Microsoft's SQL Azure (relational storage) is to the tune of 100 times costlier than Microsoft's Windows Azure Table storage cost (NoSQL type Table storage) and hence straight forward migration of on-premise SQL Server relational database to off-premise SQL Azure database can have very high operational/running cost over time.

Data Migration Considerations - Most of the enterprise data sources or LoB apps are built using relational data storage. Migrating it to Cloud on non-relational data storage solution can pose technical challenges in terms mapping the relational data elements to flat structure Table storage. E.g. Migrating a SQL Server relational Database to Windows Azure Table storage can be challenging and technically complex task especially when the data stored in Windows Azure Table Storage is in the form of key values as against relations in SQL Server. This type of data migration can break the data integrity, consistency and introduce redundancy.

Data Movement to Cloud - Data movement to cloud is one of the major obstacles for processing data on the cloud. Most of the times, the enterprise data warehouse to the tune of GB and TBs of data are difficult to ship to cloud unless a very high network bandwidth solutions are used to transport data to the cloud.

Technology readiness/maturity -As Cloud Computing is still evolving and still there are lots of technology vendors who have not made their products Cloud ready although they could be leaders in on-premise technology solutions. In such cases, not all the time, there is a corresponding Cloud technology solution to the existing on-premise technology solution and at times it can become the major limitation for choosing Big Data processing solution.  E.g. Microsoft's SQL Server Analysis Services, Integration Services are BI technologies for on-premise but aren't supported on Windows Azure (Cloud).

To address some of the above mentioned challenges, some intermediate solutions can be considered such as,

Leveraging public cloud for non-regulation sensitive data - In this strategy, only non-regulation specific business data is moved to public cloud for processing. The processed results are then combined with any regulation sensitive data on-premise to deliver combined results.

Participating through Community Cloud -Enterprises having similar concerns or business objectives can form an association to poll resources amongst participants. This is a recommended approach when there is a high degree of inter-op and data sharing is required amongst the participating organizations and at the same time the data needs to be secured or sandboxed for their respective private usage only.
E.g. NYSE created a community cloud for financial traders [4] to deal with use cases such as rapid provisioning of hedge fund Compute-on-Demand for agency brokers, Processing large volume of market data for regulatory reports for investment bank, temporary requirement of large farm for compute farm to test and validate strategy for low-latency hedge funds, and testing custom developed applications in large farm for all size financial service firms.

Creating Private Cloud - Create an enterprise private cloud by leveraging existing infrastructure or procure private cloud infrastructure from 3rd party like hosting Microsoft's Windows Azure Appliance, etc. This is suggested when the enterprise cannot afford to put any of its data beyond the enterprise boundary and wants strict control on its data.

Creating Hybrid Cloud - Extend the private cloud to leverage additional compute, storage from Public Cloud to deal with excess demand during peak periods. This helps achieve best of both the worlds but also has the complexity of management, adherence to SLAs, etc.

[1] Basic Principles of European Union Consent and Data Protection, Posted on July 25, 2011 by Christina Hultsch
[2] Definition, HIPAA (Health Insurance Portability and Accountability Act)
[3] Resources discussing compliance to Security and privacy for Microsoft Data Centers and Microsoft Window's Azure
[4] NYSE Technologies Introduces the World's First Capital Markets Community Platform, June 1, 2011


Nice read Sudhnashu!
The most compelling part comes for the datawarehouse community. A whole new platform for data analysis and a plethora of offerings in terms of technology to support various flavors of Hadoop and Big data analysis.

Its pleasure to read. Hope for more in Cloud-Computing.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter