Infosys’ blog on industry solutions, trends, business process transformation and global implementation in Oracle.

« Empower with Hyperion on Mobile | Main | HAS HFM NOT REALLY MOVED TO CLOUD?? »

Big Data & ETL's Evolution

Big Data & ETL's Evolution

The need for Extract Transform Load (ETL) Tools are ever-present as long as data consumption is there. ETL tool has been used in batch processing and transforming data as per the format required by data warehouse. Transformations have evolved into more complex due to enormous growth in the amount of unstructured data. 

At High level, Big Data Hadoop eco system consists of,

  •     Structure data :  High level of organization. Data is stored typically in organized table structure.
  •        Unstructured data :  Data is not stored in any organized form. E.g. data from Social media, Smart phones, Sensors, Images, Emails, etc.
  •        Hadoop (Hadoop Distributed File System) :  Framework for processing/storage of extremely vast data; breaks the data into chunks and stores in the participating node servers.
  •      MapReduce :  S/W Framework for processing vast data on multiple clusters(nodes) in parallel in master (Map task) and Slave (Reduce task) mode.
  •    Spark : Data analytics tool that operates on distributed data sources like Hadoop.
  •     Pig & Hive : Both ease the complexity of writing complex MapReduce programs (Similar to Scripting/SQL but not exactly).
  •    Sqoop : Migrates data in/out of Hadoop and relational data bases.

(Note: some of the above components are optional)


Fig 1. Hadoop Eco System 

Given the growth and significance of unstructured data, there has been increase in need for major ETL players to provide solution options for transforming unstructured data to be used in analytics. Most of the ETL tools in the market are successfully marching towards that path. Here are some of the ETL tools offerings w.r.t Big Data,

Oracle - ODI:                                                                        

The approach of Oracle's BigData is to enable Client's current data architecture to incorporate BigData and help to get more value to business and prospective analytical reporting and enable it to support other big data needs. ODI is important key tool for Oracle in this pursuit. The advanced new Big Data Wizard in ODI supports many new Hadoop technologies.


Fig 2. Oracle Data Integrator

ODI ELT doesn't require middle tier engine for supporting big data components whereas typically ETL tools require intermediate servers to convert the mapping into programming languages like C++ for execution. ODI leverages its predominant feature of using underlying database efficiency for the processing to support big data. ODI ability to produce native code results in tremendous efficiency for the processing can be attained.


IBM has introduced a new suite 'BigInsights' for big data and analytical reporting. BigSQL authorizes Cognos to configure Hadoop as a data source. BigSQL can access Hive, Hbase and Spark synchronously using a single DB connection via Hadoop.

Business analysts and Executives can experience visually enhanced Big data reports from Cognos Presentation service which is a good value addition for understanding Big data. With BigInsights and BigSQL, IBM is providing tools for enabling Hadoop operations, including the ability to exchange components with the existing infrastructure and functionality of Cognos.


IBM platform for DataStage has engineered an easy integration service of heterogeneous data, including big data at rest (Data is stored and analyzed. E.g conventional data warehousing) or big data in motion (Dynamic data based on Real-Time or operational intelligence architecture. E.g Trading, Fraud detection, etc.). 

DataStage, in its newer versions, now includes components such as new Big data file stages to access files (both read &write) from HDFS, Hive stages or has Stages to automatically generate MapReduce program.

Talend Studio for Data Integration:

Talend Data Fabric solution delivers high-scale and in-memory fast data processing. To generate native Spark and MapReduce code, it leverages Hadoop's parallel environment property.

Since Talend Open Studio is an open source solution it can be downloaded at no cost, but support will be provided only for subscription products. Subscription products has more functionality like shared repository, versioning and dashboards.

PowerCenter Informatica:

Informatica Corp launched Informatica BigData Edition which can be used for ETL in Hadoop environment along with RDBMS. Informatica BDE is available in versions 9.6 and later.

BDE runs in two modes, Native mode for normal power center ETL and Hive mode to support BigData additionally. Mappings moved to Hive will be executed in Hadoop cluster using Hadoop's parallelism (By MapReduce cability).

SQL Server Integration Services (SSIS):

Microsoft has new Visual Studio 2015 tools which contains new SQL Server Integration Services (SSIS) Tasks. This provides ETL options on Apache Hadoop, Sqoop for data import/export, Hive for SQL queries, the MapReduce distributed programming infrastructure and ODBC drivers to connect to your data in HDFS from tools like Excel and SQL Server.


Jaspersoft amended OEM agreement with Talend to use native connectors to Apache Hadoop Big Data environments in Jaspersoft ETL. Also Integration of Talend into the Jaspersoft BI Suite, supports all Big Data use cases.

                Talend supports major Big Data platforms including Amazon EMR, Apache Hadoop (HBase, HDFS, and Hive), Cassandra, Cloudera, etc. For the robust performance and reliability, Big Data Edition has high availability and load balancing features for critical reporting and analysis requirements.

List of ETL Big Data Solutions Vendor-wise:


Big Data

Big Data in Cloud


ODI for Big Data

Oracle Data Integrator Cloud Service


BigInsights Suite

IBM BigInsights on Cloud


Native BD file stages.

IBM Bluemix - IBM InfoSphere DataStage on Cloud


Informatica Big data edition BDE

Informatica Big data edition BDE


SQL Server Data Tools for Visual Studio 2015

Azure Data Factory


Talend Big Data Integration platform

Talend Integration Cloud


Talend native connectors

Amazon Redshift


-  Xavier Philip

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles