Big Data, Cloud and Analytics
Big Data is the biggest buzzword today as it poses enormous challenges, complex problems that technologists around the world are busy trying to solve. Big Data refers to huge sets of structured and unstructured data. Structured data is the one which can be classified and stored in a pre-defined table schema. For example, when we submit an online payment form, the required metadata information is known before hand and can be stored in a well defined schema. Semi-structured or unstructured data on the other hand, is more like a free form data which does not adhere to any particular schema and is hard to parse and process. Examples are Twitter hash tags appearing in tweets, Facebook updates, comments and mentions, logs, topics and sub-topics that go in a wiki page etc.
Organizations capture data from various sources relevant to their business. Captured data volume has grown rapidly in last several years and managing data has always been a difficult problem to solve. Big Data adds complexity to this problem in mainly following ways:
- Growth rate of captured data volume is high and increasing by the day. It is difficult to comprehend the amount of data gathered each day. Data comes from various sources including but not limited to:
· Millions of financial transactions
· Social media updates
· Employee swipe-ins & swipe-outs
· Guest records and preferences at hotels
· Hundreds of thousands of flight records per day
· Information relayed by satellites
· Daily news
· Individuals' medical history and healthcare records etc. etc..
As per Wikipedia, an average of 340 million tweets were posted per day in 2012. Imagine the number of social media updates and photo uploads each day. The data size is huge and is generated at lightning speed.
This is creating storage problems and there is a need for innovative storage solutions. Not only do we need effective storage solutions, we also need faster and effective data transfer technologies. Storage capacity which was more than sufficient 3 yrs. ago, is no longer adequate for data volumes generated today. Scaling out is not an optimal solution in the long term. It can only address size but not speed. Saving data across multiple clusters to make it highly available for distributed systems in real time has to be improved with new technologies that may need new protocols.
- Processing huge data sets requires enormous power in terms of computing cycles and time. We need powerful and smart parallel processing systems which can process large data sets fairly quickly. Google took a lead here and came up with the famous MapReduce framework to solve this problem. It has gained popularity and is considered de-facto in its space. Apache Hadoop, which is an open-source implementation of MapReduce, is widely used. Hadoop, with its eco-system of technologies, is a framework for distributed processing.
- A large part of data being captured is unstructured or semi-structured. Data like tweets, social updates, blogs, pictures etc. all contribute to unstructured data. Deriving useful information out of it is a challenge. Conventional relational databases are effective for structured data. To store and query unstructured data effectively, database solutions like NoSql have emerged. Based on Nosql philosophy, Google, Facebook and others have developed databases specifically to store and process unstructured data. This is where a good analytics solution can help mine and extract actionable information.
Cloud does not provide any direct solution to Big Data problems but it helps. Traditionally organizations used to build their own datacenters to store and process data. Building and maintaining a datacenter is an expensive affair. To that effect, cloud helps in reducing costs. SMBs which can't afford their own infrastructure can leverage cloud for data storage and processing. Together IaaS and PaaS provide cost effective solutions. A big advantage of using third-party IaaS and PaaS solutions is that one can scale up or down dynamically on need basis.
Cloud based data processing solutions have also started emerging which can be a good value-add. Google has started offering data analysis (OLAP) services through cloud. Known as Google BigQuery, it offers SQL based query and analysis services on large and very large data sets. It is exposed as REST based web service and also comes with its own browser tool.
With cloud, of course there's also another important aspect to be considered - data security. There could be multiple security concerns for data stored in the public cloud which should be understood by consumers before taking that route.
Data stored just for the purpose of record-keeping does not have much value. The real value lies in bringing out useful information hidden in the data. Organizations mine and analyze data to derive useful information which helps them in taking important business decisions. Traditionally this is known as Business Intelligence. It is complex but still easier with structured relational data. Mining information from extra large volumes of semi-structured or unstructured data changes the game completely. Traditional data warehouses and mining techniques may not work and a more sophisticated approach may be required. Perhaps we can take a cue from SIEM systems which work on large volumes of semi-structured data in security domain. SIEM or Security Information and Event Management systems collect security related data and event logs from various network resources like routers, switches etc. and applications such as anti-virus, and mine that information in real time to detect potential security breaches and generate alerts. They use complex correlation rules to link data from different resources and analyze it.
Companies like Teradata and IBM have developed their own Big Data Analytic solutions. This is a relatively niche area with enormous business potential and these companies are some of the early players.
Big Data problems have been there for a while. They has become viral only now because of ever increasing social networking, data sharing and transaction processing.
Companies like Google and Facebook have developed some remarkable technologies along the way and have taken a big step by sharing them with the world. This has provided a head start for big data solution providers and paved the way for the development of new technologies to solve these problems more optimally. Strong industry forces are joining hands to develop technologies for tomorrow which shall help bring down the anxiety around Big Data. What we need now is a strong ecosystem of applications capable of handling big data.