Hadoop – the Open source BI+Data warehouse solution
Increasing the capacity of computer will not help now as we live in Data Age, so we will use multiple computers and treat them as one.
It’s not easy to measure data volume stored electronically. As per few statistics published by IDC studies (http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf)
- The digital universe in 2007 — at 2.25 x 1021 bits (281 exabytes
or 281 billion gigabytes) — was 10% bigger than we thought.
- By 2011, the digital universe will be 10 times the size it was
Interesting!! So why Im talking about data when I’m suppose to talk about BI, let we find out.
As this statistics shows the data is growing and growing even greater then what we can even think of. So, domain of analysis on this scale of data also requires innovative solutions that will make sense from the cost and data type point of view.
As we know this market is never to say “dead market” and now even there is more and more emphasis on Knowledge driven applications for enterprises. Some of these applications are developed internally or customers are betting on vendors to fulfill their dreams. Whatever is the case, the heart of these types of applications is BI.
As per few reports the total market of BI itself is around 5 billion US dollars and that is keep growing.
BI systems feast on data, traditionally BI system work on structured data and should say very successful. Traditional BI solutions are relied on RDBMS data stores; people can have BI system that uses RDBMS store or it uses with the combination of Data warehouse(DW) . Data warehouse is an intermediate storage between application data stores and BI systems. Data warehouse doing the intensive job of extracting, transforming and loading data.
In this field there are many big players who are providing DB related products and latched the opportunity of BI+Data warehousing. As it was in the niche domain and not many players, vendors have started giving solutions with a full bowl of salt and not pinch, vendors have provided services which are very-very expensive and the maintenance cost is also too high. This has started hitting customers sooner or later and in this recession, on everybody’s the first thing was saving.
The second problem with the current or traditional BI+DW solution is that it was having very limited feature for unstructured or semi-structured data processing.
Now a day’s customers are looking for Knowledge driven applications that requires crunching of unstructured and semi-structured data along with structured data.
Customers are interested in analyzing GB or PB’s of data. Sometimes this analysis is not the first most requirement of the BI solution as business critical problem, but as an analysis ecosystem for knowledge based systems this type of BI analysis is also very important.
With the proliferation of Web and Web2.0 technologies, customers are having large amount of unstructured data too compare to structured one, and this is keep growing then even what we can imagine. The client systems that are providing more and more interaction with the customers, that will have more and more unstructured and semi-structured data and interestingly this data is also that important as people say for BI+DW data sets.
So, customer wants to process this data as well for solving and explore business critical problems. At this point customer has to think in two directions a. structure b. unstructured and semi-structured data. User can go with the traditional approach for pure RDBMS based repositories but for later direction, traditional approach fails because of the structure of data and scale of data.
As traditional solutions cost will keep growing as your data grows after a threshold till what we have already paid a hefty amount for BI+DW traditional solution. With most the solutions we need high end hardware to run this software that also increases the cost of solution and reduces ROI of the customer.
Now if we talk about second direction in which unstructured or semi –structured data needs to be crunched then the biggest hurdle is the scale (depends on the source and event of information) of the data present and going to increase day by day. Even if we think that traditional BI+DW solution stack will be able to process this type of data, but with this scale the cost of the solution would be very-very high and it’s not a onetime problem and needs to be addressed properly for gaining and taking edge in longer run.
Even when the source of information in the existing systems are not RDBMS based and mostly we need to do ETL steps to make it available in Data Warehouse that is RDBMS based. In this situation BI+DW can be replaced by other solutions that are based on unstructured and semi-structured data. These solutions provide two simple to understand advantages are cost and scale of data. Example in this category is in Telecom domain the main BI activity lies around Call Detail Records (CDR) that is mostly ASCII format. In telecom domain thanks to the huge number of consumers, the resultant traces of service usage by consumers are CDR, so the amount of data needs to be handled is huge.
So, scalability is one of the prominent problems in telecom BI+DW domain. Hence, solutions that provide economical solution is the need of Telecom service providers. Brains worked to solve the above reasoned out problems were a bit successful when open source community to work on this domain. A need of system that should be cheap compare to traditional BI+DW solutions and that should be having advantage in processing unstructured and semi-structured data.
First we will see some real examples of data processing for various industries, this gives fair idea of where all hadoop can be useful.
Sample of data that will be useful for companies range from
a. Server logs
a. Fault detection
b. Performance related analysis
b. Network logs for
a. Network optimization
b. Network fault detection
c. Transaction logs
a. Financial related analysis
d. Email traces /logs
a. Consumer email analysis
b. Decision system on the basis of Emails analysis
e. Call Data Records (Telecom domain)
a. Precision Marketing Analysis
i. User behavior analysis
ii. Prediction on the basis of Customer churn
iii. Service association analysis
f. Distributed Search
a. At the scale of Web data (Petabytes)
Examples prescribed her e are limited and are increasing day by day.
The interesting points are
a. Data can be structured, semi-structured or unstructured
b. Scale of the data is keep growing, and that needs to be tacked using innovative solutions apart from the Traditional BI+DW solutions.
c. For few of the use cases you still can go and might afford sophisticated BI solution based Data Warehouse and ETL but for few normally even we don’t take backup for more than an year, because storage limitations and if we do analysis on this data it would be very costly. But still even this data is also important for various analyses.
One of the prominent options to target these problems is to use Open Source BI solutions.
Hadoop is one of the Open Source solutions that got a lot of attention from industry, this is an Apache product and Doug Cutting baby (the Lucene API parent as well), I’ll discuss on Hadoop further here.
Hadoop (http://hadoop.apache.org ) is under Apache group umbrella. It is designed on Shared Nothing architecture (SN) principle. SN is for distributed computing that means each node is independent and autonomous, plus there is no single point of bottleneck. Google has demonstrated how SN can scale almost to infinite, Google called it sharding .
Hadoop has key components that make it suitable for solving these problems
a. Hadoop core: Provide the features like Fault Tolerance, Job Monitoring etc.
b. Distributed File System: HDFS
c. Map Reduce Implementation : for parallel processing
d. SQL interface for Map Reduce : for Data warehouse kind of solutions : HIVE
It has more components and many features, but I restrict to give an intro of Hadoop as an option while choosing Open Source BI tool set / framework for commercial solution or research work.
Companies have already materialized Hadoop for Data warehouse / BI implementation. For example facebook has created BI / Data warehouse solution that is based on Hadoop
Statistics about facebook cluster is something like:
· 4 TB of compressed new data added per day
· 135TB of compressed data scanned per day
· 7500+ Hive jobs on production cluster per day
· 80K compute hours per day
One more innovative company in Cloud Computing category has recently published about using Hadoop to solve business problem, Rackspace has done a innovative work to analyze Email logs to support their customer better as Customer Service is the USP of Rackspace (http://blog.racklabs.com/?p=66). Interesting to find the usage of Solr and Lucene in this implementation.
There are many other companies who are using Hadoop for BI / Data warehouse related problems.
There are some limitations of Hadoop also that also I like to highlight in brief:
1. Hadoop is built for Batch jobs kind of work and not interactive work
2. Hadoop has high latency and low throughput (because of its distribution of jobs in nature)
So, it’s wise to explore and invest in Open Source BI solutions and one flavor in that space is Hadoop.
Right now IPL (Indian Premier League) of cricket is not that famous as EPL (English Premier League) of football, as it’s only 3 year old only but still IPL has gone ages J
Same way Hadoop is new but here to remain and will get into mainstream as more research and applications materialize Hadoop to showcase its capability and value.
Continue to this blog, further I’ll like to discuss the architecture of BI+DW solution that gives us more points of discussion and improve to create a hadoop based BI+DW framework.
Your views are more than welcome.