The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« Can BPMS expedite Application Development? | Main | ISO 23026 -2006 »

BI Open Source Story - Are we there yet

Recession reminded us of the Darwin's theory of 'Survival of the Fittest', and that's what we are seeing - the companies which have focused on optimizing costs, efficiency and productivity have had their flags high today. Those are the companies which did a balancing act of managing costs and yet retain their best talents in people. Open Source is one horizon those companies are starting to venture out.

Exactly a year or two back there were big IT spending and budgets allocated towards building enterprise wide solutions, with BI and Data Integration at the core front. The prime reasons for such big budgets were not only the high licensing costs of available product solutions but the availability of SME's to implement those solutions. Pick up any of the Gartner, Forrester or other surveys of the world for past 4-5 years of organization priorities and top priority list will always have BI, Analytics, Data Integration and Reporting as most in demand. Today, the focus & the need of the hour is leading towards open source, SaaS/PaaS, Cloud - and mind you none of the key requirements are being compromised (infact it’s even more demanding):

1. Low cost of ownership

2. Increased levels of higher Performance

3. Reliability

4. Flexible and Adaptable (integration point of view)

5. Larger support base - compared to the vendor provided technical support model


Few common Myths about Open Source:

1. Open Source has scalability problems, and can't support larger enterprise wide applications - concerns around the poor quality of testing, performance, security capabilities. On the contrary Open Source go thru equally if not more enhanced, robust and rigorous testing as the larger community (early adopters) provide far greater inputs than traditional vendor products. The problems typically get fixed early in the development cycle.

2. Open Source vendors don't have full support and ownership - Another myth as copyright laws apply equally to Open Source vendors as other traditional product vendors. Only difference is the benefit from open source vendors to share their IP's with larger pool of audiences. Open source does come with various support models, and with few of those vendors you can choose to go for Enterprise level support as well. Infact as a customer you get to choose to buy support & services.

3. It's yet to mature in IT world - 5 years back this would have been an absolute Myth, however, today with various small, medium and large sized organizations adopting and utilizing the benefits of open source this can be only considered as a mis-conception.

4. Only fine till the developer community and target audience is developer - with end-to-end Enterprise suites of applications in market providing BI capabilities, and majority of that being consumed by Business users leveraging the pervasive nature of BI to the fullest, this statement no longer stands true in itself. Yes its still quite popular and challenging in developer community, however organizations are considering it on serious notes of their business decisions.

5. Security concerns - with Open Source code available to all there's a risk of security threat, and anyone can break its security. If one understands and believes that Open Source are built using standards, principles and methodologies as any other software, this myth doesn't stand a chance.


So what are the areas where Open Source can help you out in the BI World?

1. Data Integration - No matter how many DI solutions are available in the market claiming complete automation, there is still a major chunk of manual coding in form of PL/SQL, Scripting, variety of tools to do your ETL. All this added to the huge integration costs for IT spending, and running cost for maintenance and support. Open source comes in handy in DI providing both cost advantages as well as enhancing productivity via the reduced automation cycles for integration.

2. Reporting - With a wide variety of options to do reporting including Dashboards, Scorecards, Static/Dynamic reporting, Real-time analysis/Analytics, and the pervasive nature of reporting today, the need for deciding on right tools within the budgets is a challenge. Open source tools are bringing in open standards that allow users to pick and plug what suits their needs without limiting to one specific vendor. What they get for free is the specialized capabilities of each tool in their space, and allows businesses to take better informed decisions.

3. Data Visualizations - Profiling on data & using advanced visualizations early in stages of integration, saves a huge amount of effort and thereby costs in later stages, providing cleaner source of data to take decisions on & figure out the hidden data inconsistencies. Open source tools with data visualizations not only provide traditional graphs and charts capabilities, but go well beyond providing advanced visualizations on data like statistical measures, probabilistic measures, patterns/clusters of data to seek problems, data duplications, plotting data on various 3-Dimensional graphs.


Key Reasons to use open source:

- Lower the costs in IT Investments

- Flexibility (Extensibility & Customization), clubbed with incorporating the latest research trends

- Minimal Vendor dependency, due to open standards for integration & collaboration models

- Larger pool of technical community, helping in quicker resolutions to the technical glitches

- Data Integration based on open standards

- Use what you need (Pay for what you need basis), and pick the best features suiting your needs

- Turnaround time on enhancing features and capabilities

- Use before you buy

- Get the best of the capabilities on Data Integration, BI and Reporting space

- Provides you a strong arm to aid your research, and put it to best of use in action


A word of caution and approach while choosing to get on the Open Source Highway - It's not too old the concept of Open Source in BI world, and definitely room for getting more mature. This will be evident with the scale and level of implementations where Open Source is doing well in future, both from performance and scalability point of view. One needs to assess the tools capabilities with their needs, and get references from service providers or consulting firms on their experiences.


I feel any organization considering to choose any BI tool or set of tools, or for that matter any tool as organizational standard the following approach will help in a longer run:


Step 1: Any organization should first list of their expectations, needs from Data Integration, Reporting and BI, irrespective of what industry tools today provide. This should list all functional, non-functional, technical, architectural and business needs.

Step 2: With those list of expectations, do a thorough vendor/tool assessment using a methodological process of eliminating and short listing required features. Rank the vendor/tools using one of the mathematical models (e.g. weighted average) on each feature or capability you need, based on the response from Vendors on your list.

Step 3: Finally the analysis on the comparative study on independent and un-biased assessment will help you figure out the tool which best suits your need.


It will be worthwhile to compare licensed commercial vendors with open source, and see which one figures out better in feature v/s the price tag or cost benefit analysis.

In Summary - BI is not only important for decision capabilities but in today's economy it’s vital to the survival against the competition. The challenging environment of reduced IT spending, cost controls, reduced tasks force and pressure to remain competitive the organization's IT groups have already started their journey on Open Source technologies. Open Source in the Information Management space is becoming quite mature in areas of Data Integration, Data Visualizations, Business Intelligence & Databases and now help several enterprises (small to medium to large sized) focus on delivering business critical information to their decision makers at lower costs, and with greater flexibility. For that level of features and capabilities, clubbed with the cost advantage – one can’t afford to ignore the Open Source story all together. Don't forget to share your experiences with Open Source.


I must say, this blog has a grand opening with a philosophical punch line ;) It's a myth that people think, Darwin coined the term/ phrase "Survival of the fittest". Fact is, that phrase was originally coined by Brit philosopher Herbert Spencer. Pardon me Yogesh for crossing the boundary ;)) Well, getting back to the subject, i concur with overall Yogesh's view on Open Source BI. I just would like to add one more myth to Yogesh's list. People generally tend to think open source is free of cost. Fact is, open source vendors usually offer two editions. 1) Community Edition - which is free of cost. 2) Commercial Edition - which is charged based on licensing terms. There is no reason as to why companies should not go for commercial editions of open source tools, after all it is directly competing with other product vendors. Standards set by commercial editions of open source tools match the product vendors tooth to tooth. Open source is poised to play a crucial role not only in BI space but overall in IT industry.

Many of the general OSS myths listed above keep getting busted and rebutted every year:-)
At least in BI perspective, you might find that for ETL, normal Unix tools can be much faster than many of the OTS software available today for this purpose. I built one data warehouse in 1996 when working for Infosys. Messaging the loadable text files using sed and regular expressions with sql*loader direct mode was blazing fast. Add a little bit of diff magic and you've really fast reduction in records to process. Also, with the distributed tools like MongoDB, GridGain/Hadoop and general advancement in open source map-reduce tools, you've lot more options now to create summary data with much less expense.

Yogesh has put forward a very good view over a BI market from Open Source angle. I’m completely agreed with Yogesh words and want to add few points from the BI Open source solution, and like to give some details for one option that Satheesh has mentioned that is Hadoop.
As we know this market is never to say “dead market” and now even there is more and more emphasis on Knowledge driven applications for enterprises internally or externally. So, even it is for sophisticated high end analysis or ad-hoc and reporting kind of jobs, BI is there.
BI is work over data, as data is vital and decisions are based on analysis, so, most of the companies have vision or already implementing analysis systems for data.
Sample of data that will be useful for companies range from

a. Server logs: 1. Fault detection, 2. Performance related analysis

b. Network logs for 1. Network optimization, 2. Network fault detection

c. Transaction logs 1. Financial related analysis

d. Email traces /logs 1. Consumer email analysis, 2. Decision system on the basis of Emails analysis

e. Call Data Records (Telecom domain) 1. User behavior analysis, 2. Prediction on the basis of Customer churn, 3. Service association analysis

f. Distributed Search 1. At the scale of Web data (Petabytes)

Examples I have prescribed her e are limited and are increasing day by day.

The interesting points are
a. Data can be structured, semi-structured or unstructured
b. For few of the use cases you still can go and might afford sophisticated BI solution based Data Warehouse and ETL but for few normally even we don’t take backup for more than year, because you Harddisc, taps etc. But still even this data is also important for various analyses.

So, these problems require economical solution that should scale also, for this requirement one of the option is open source Apache Hadoop ( ). It is designed on Shared Nothing architecture (SN) principle. SN is for distributed computing that means each node is independent and autonomous, plus there is no single point of bottleneck. Google has demonstrated how SN can scale almost to infinite, Google called it sharding .

Hadoop has key components that make it suitable for solving these problems
a. Distributed File System: HDFS
b. Map Reduce Implementation : for parallel processing
c. SQL interface for Map Reduce : for Data warehouse kind of solutions

It has more components and many features, but I restrict to give an intro of Hadoop as an option while choosing Open Source BI tool set / framework for commercial solution or research work.
Companies have already materialized Hadoop for Data warehouse / BI implementation. For example facebook has created BI / Data warehouse solution that is based on Hadoop

Statistics about facebook cluster is something like:
> 4 TB of compressed new data added per day
> 135TB of compressed data scanned per day
> 7500+ Hive jobs on production cluster per day
> 80K compute hours per day
There are many other companies who are using Hadoop for BI / Data warehouse related problems.

Before signing of lets see some of the limitations of Hadoop also:
1. Hadoop is built for Batch jobs kind of work and not realtime jobs
2. Hadoop has high latency and low throughput (because of its distribution of jobs in nature)

So, it’s wise to explore and invest in Open Source BI solutions and one flavor in that space is Hadoop, by keeping mind in the limitations.


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on