Infosys’ blog on industry solutions, trends, business process transformation and global implementation in Oracle.

« Building Apex applications on Oracle Database | Main | Big Data Processing Architecture »

Hadoop Overview

 

This blog deals with the Architecture of Hadoop, advantages and disadvantages of Hadoop.

------------------------------------------------------------------------------------------------------------------------------------------

 

Let's first understand what is Hadoop?

Hadoop is open source framework for processing and storing large data sets across the different clusters of computer which are present in different geographical locations.

Now let's understand why Hadoop?

The problem with the traditional database management systems is that it can process only structured data and it can handle only small amount of data (giga bytes). Hadoop can handle structured, unstructured and semi structured data. Hadoop can handle large amounts of data with high processing speed through parallel processing.

The Architecture of Hadoop has mainly two components. They are

1.       Hadoop Distributed File System - For Storing Data

2.       Map Reduce - Processing

MRArchitecture.png

 

Name Node is the master node which does the tasks like memory management, process management. It is the single point of failure in Hadoop Cluster. Secondary Name Node takes the backup of the namespace of the Name Node and updates the edits file into the FSimage file periodically. Data Nodes are the slave nodes which does the computations.

When client submits the job to Name Node, it divides the files into chunks and distributes the chunks to Data Nodes for processing. Each chunk is replicated 3 times and will be stored on three different Data Nodes. If one Node is going down, then the Name Node identifies the Data Node which have the replicated file and starts execution. This process makes Hadoop a fault tolerant System.

Now let's discuss the Limitations of Hadoop

1.       Handling small files:

If you want to process large number of small files, then Name Node needs to store the HDFS location of each file. This will become over head for the Name Node. This is the reason why Hadoop is not recommended when it comes to handling large number of small files.

 

2.       Processing Speed:

To process large datasets MapReduce follows Map and Reduce mechanism. During this process, the intermediate results of Mapper, Reducer function are stored to HDFS Location which results in the increase of I/O operations. Thus, the processing speed get decreased.

 

3.       Not able to Handle Real Stream Data:

Hadoop can process large amount of batch files very efficiently. When it comes to Real Stream processing, Hadoop failed handle the real-time data.

 

4.       Not Easy to Code:

Developers need to write code for each operation they need to perform on data, which makes it very difficult for them to work.

 

5.       Security:

Hadoop does not provide proper authentication for accessing the cluster and it does not provide any information about who has accessed the cluster and what data the user has viewed. Security is the biggest draw back when it comes to Hadoop.

 

6.       Easy to Hack:

Since Hadoop is written in Java, which makes cyber criminals to hack the system very easily.

 

7.       Caching:

There is no cache mechanism in Hadoop for storing the intermediate results for further use. As result of this the performance got diminished.

 

8.       Line of Code:

The line of code for Hadoop 1,20,000, which makes it difficult for debugging and executing.

 

9.       Unpredictability:

In Hadoop we can't guarantee the time for completion of job.


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles