Infosys’ blog on industry solutions, trends, business process transformation and global implementation in Oracle.

« My experience with Form Development in HFM | Main | How's your day, Workday! »

Alternatives to Structured Analytics--Big Data and Hadoop:

What it is:

In traditional times tools like SQL Databases, files etc. were used to handle data and its issues. With evolution of social media unstructured content has led to exponential increase in the volume of data. With this huge volume traditional tools faced lot of issues to store, retrieve manipulate data and hence BIG DATA evolved. 'Big Data' consists various techniques& technologies for storing, retrieving, distributing, managing & analyzing extremely larger-sized datasets with high-velocity & different structures. It can manage data in any of forms viz. structured, non-structured, semi-structured which traditional data management methods are not capable of handling. Data in today's world is generated from versatile sources and should be integrated with different system at various rates. For processing large volume of data in less expensive and efficient method, parallelism is used as an integral component of Big Data. Precisely speaking the terms Big Data is a data set whose scale, diversity, complexity and integrity is managed by new and robust architecture. Its techniques, algorithms and analytics in it is used to manage and extract hidden knowledge from it.

For structuring Big Data, Hadoop is used as a core platform, and it also solves problem of making Big Data useful for analytical purposes. Hadoop is an open source software which enables distributed computing to make it RELIABLE and SCALABLE for huge data sets across clusters of different server nodes. It is designed to scale up from a single node server to thousands of machines or multiple nodes with a very high degree of fault tolerance. Hadoop is designed to supports applications which are written in high level languages like java, C++, python etc.

How it Works:

In Hadoop, data is distributed to different nodes and initial processing of the data happens locally on the node. It requires minimal communication between the nodes and the data. Whole structure which can be replicated multiple times hence it provides the better solution to Problems in Big Data. Hadoop is a software programming framework in order to support processing of large data sets in a highly distributed computational environment. Hadoop programming framework in a data warehouse will include data consolidation from different sources in one place as Big Data platform and can be accessed with OLPA servers. Data Warehouse will serve the analytics need of the users on which all the consolidated reporting will happen.

What runs behind the scene:

Hadoop is designed and developed on MapReduce model of programming which is a based on a principal of parallel programming across different nodes, combined to be referred as clusters. An individual machine is referred as node and hundreds of machines together are referred as Hadoop Clusters. In Hadoop cluster, data is distributed across all nodes in the form of small parts known as blocks. The size of each block is defaulted to 64 MB for an Apache Hadoop cluster. One single file is stored in small blocks on one server and its copies are saved on multiple servers present in a cluster. This allows map reduce functions to process on small subsets of data which are in turn part of large datasets. Map and reduce is the key component of Hadoop which provides huge scalability across multiple servers in a Hadoop cluster which is needed for BIG DATA processing.

Hadoop also includes a fault tolerant storage mechanism names as Hadoop Distributed File System or HDFS. HDFS is designed in such a way that it is able to store high volume of data and information, it can scale up incrementally and any failures in infrastructure can be survived without impact on any data. In simple ways Hadoop clusters can be built with less expensive computer systems  so that fault tolerant can be achieved and in case if one node fails, overall system operates in cluster mode without losing data or interrupting work, by shifting work to their working nodes in cluster.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles