Infosys’ blog on industry solutions, trends, business process transformation and global implementation in Oracle.

« Gamify your Sales Automation (Baseball theme example included!) | Main | FCCS Integration with Oracle Fusion Financials - End to end process and pain points »

Oracle Exadata, MPP Databases or Hadoop for Analytics


There is a plethora of databases today for example SQL databases, No SQL databases, Open source databases, Columnar databases, MPP Databases etc. Oracle which is a leader in the relational databases space is often compared to these. So let us look at some basic differences between Oracle and some other players like MPP databases (Teradata, Vertica, Greenplum, Redshift, Netezza etc). and Hadoop for various Analytics workflows


Oracle, Teradata, Vertica, Greenplum, PostgresSQL, Redshift and Netezza are all Relational Databases. However, Teradata, Vertica, Greenplum, PostgresSQL, Redshift and Netezza are massively parallel processing databases which have parallelism built into each component of its architecture. They have a shared nothing architecture and no single point of failure. On the other hand, Oracle database has a shared everything architecture. Even Exadata (which is Oracle's engineered system or Oracle's database appliance specifically for Analytics or OLAP) is based on existing Oracle engine which means any machine can access any data which is fundamentally different from Teradata as shown in diagram below. Thus MPP databases are able to break a query into a number of DB operations that are then performed in parallel thus increasing the performance of a query

This brings us to the next logical question of how are these MPP databases different from Hadoop? Hadoop is also an MPP platform. The more obvious answer would be MPP databases are used for structured data while Hadoop can be used for structured or for unstructured data with HDFS - a distributed file system. Also, while MPP databases introduce parallelism mainly in storage and access of data.  Hadoop, with Map Reduce framework is used for batch processing of large amounts of structured and unstructured data more like an ETL tool. So it is a data platform



Oracle and most MPP databases use SQL interface while Hadoop uses Map reduce programs or Spark which are java based interfaces. Apache HIVE project however is aimed towards introducing a SQL interface over Map Reduce programs.


The other difference between these systems is that most MPP databases like Teradata and Oracle Exadata run on propriety hardware or appliances while Hadoop runs on commodity hardware.


Oracle Exadata and most MPP databases scale vertically on propriety hardware while Hadoop scales horizontally which results in a very cost effective model especially for large data storage


The MPP databases use columnar data storage techniques while Oracle uses row wise storage which is less efficient in disk space usage and also in performance to columnar storage. However, Oracle Exadata uses Hybrid Columnar Compression (HCC) which is an aggregate data block created above the rows of data. The compression is achieved by storing the repeating the values only once in the HCC. Thus performance of Oracle Exadata is considerably better than row wise storage Oracle database. Hadoop on the other hand supports HDFS which is distributed file storage


Oracle is often the choice of database for Analytics where Oracle ERP systems are deployed. Oracle Exadata can meet OLAP workflow/ DSS requirements and has many Advanced Analytics options. More details can be seen at Oracle's Machine Learning and Advanced Analytics 12.2c and Oracle Data Miner 4.2 New Features.


Teradata is the choice of DB in case of pure OLAP workflows with its massively parallel processing capabilities especially when data volumes are high. Teradata is also the preferred choice in case of low latency analytics requirement where an RDBMS is still required However, it is losing market share as Teradata migration is a priority for most cost conscious CEOs due to its prohibitive year on year expense. Another reason for migration off Teradata is the adoption of new generation data analytics architecture with support for unstructured data.


The above sets the stage for Hadoop with its support for big data which can be structured or unstructured. It provides a platform for data streaming and analytics over large amounts of data coming from IOT sensors, social data from various platforms, weather data or spatial data. It is based off open source technologies and uses commodity hardware which is another attraction for many companies moving from Data warehouse to data lake ecosystem.



Thus, it is important to consider the Use case a database under consideration is designed to serve before deciding the best fit for your Big Data ecosystem. Making a decision solely on amount of data (Petabytes or terabytes) that need to be stored might not be accurate. The other factors that can influence one's decision might be your overall IT landscape/ preferred Infrastructure platform, developer skills, cost, future requirements which is specific to each individual organization. So though new age databases are opening new opportunities for data storage and usage, the traditional RDMS will most likely not go away in the near future.





Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles