Welcome to the world of Infosys Engineering! It is a half a billion plus organization that takes pride in shaping our engineering aspirations and dreams and bringing them to fruition. We provide engineering services and solutions across the lifecycle of our clients’ offerings, ranging from product ideation to realization and sustenance, that caters to a cross-section of industries - aerospace, automotive, medical devices, retail, telecommunications, hi tech, financial services, energy and utilities just to name a few major ones.

« Global Product Engineering & the “Core to Business” debate… | Main | Leveraging the power of Crowdsourcing for Localization »

MapReduce

MapReduce is a programming model invented by Google for processing web scale data (check Google’s original paper on MapReduce here). Web applications like Facebook, Google Search Engine, Flickr etc., produce and consume huge amount of data. For example Facebook generates around 2 Peta Bytes of data everyday (as per the article here) and processing such a huge amount of data requires computing power of 100s of CPUs and could take several hours to complete. MapReduce aims to simplify this by automatically parallelizing the processing of such a huge amount of data and hiding the details of parallelization, fault tolerance and load balancing from the user. MapReduce is mainly suited for processing data in batch mode and is not suited for interactive applications. For e.g. offline processing of web logs to generate information about users, advertisers, publishers or processing archived data stored on tapes and generating reports, charts etc.

MapReduce drew its inspiration from list processing primitives (Map and Reduce) present in functional programming languages like Lisp (Read about functional programming here). Users of MapReduce need to write two functions: Map and Reduce. Map function operates on a set of key/value pairs (usually key is a file and value is contents of that file) and generates an intermediate set of key/value pairs. Reduce function operates on this intermediate key/value pairs and merges values for the same key into a final output.

For e.g. let’s assume that you have to find a count of different URLs in a collection of 1000s of web logs. Map function which can be applied to each web log can be written as below:

map(string key, string value) :
    // key : file
    // value : contents of the file
    for each URL found in value
          Emit(URL, 1)  ;              // Emit intermediate <key, value> pairs

Here for each URL that is found in the input file, the map function emits <URL, count> pair. For e.g., if “www.yahoo.com” is found 5 times in web log 1, then <www.yahoo.com, 1>  is emitted 5 times.

MapReduce library takes care of accumulating all the values for same key and send that information to 1 or more reducers. The reduce function for this problem can be written as below:

reduce(string key, Iterator values) :
    // key : Intermediate key. E.g. www.yahoo.com
    // value : Values for the same key. E.g. URL count
    int result = 0 ;
    for each value present in values
          result += (int)value ;        //  Calculate the URL count
    Emit(result)                           // Final URL count for URL represented by key
For e.g. if <”www.yahoo.com”, 1> is emitted 100 times by different mappers, the reduce function arrives at final value of URL count for “www.yahoo.com” as 100, after adding up all intermediate values.

Following diagram illustrates operations of Map and Reduce functions (for simplicity only one Reducer is shown. In practice there will be multiple Reducers operating in parallel).
 

MapReduce.JPG

For processing the massive data sets, a set of Mappers (each executing the same Map function), each operating on a subset of data and running on different machines, produce an intermediate data set. Then a set of Reducers, operating on the data produced by different Mappers, produce a final desired result. The key to achieve parallelization is to divide the input data into ‘n’ fragments, each of which will be processed in parallel by ‘n’ different Mappers (as opposed to a single process operating on the entire data, which may take huge amount of time). The intermediate output of ‘n’ Mappers is fed to 1 or more different Reducers, which operating in parallel, produce the end result.

MapReduce library lets users to specify number of Mappers and Reducers. Usually number of Mappers is equal to number of data fragments. Typically the number of Reducers will be lesser than the number of Mappers, as Mappers would have already filtered out unwanted data. Also, as opposed to other distributed data processing libraries like MPI, where data is moved nearer to the computation (i.e. data is moved b/w different machines for processing), MapReduce moves computation nearer to the data (i.e. the executables are copied to the machines where data is present). This results in efficient processing of large data (copying of large data b/w machines is expensive). Distributed file system used by MapReduce library automatically splits huge data sets into several smaller data sets which are evenly spread across a cluster of machines. Google’s implementation of MapReduce uses the Google File System (GFS) (Refer earlier blog on GFS here). Hadoop, which is the open source implementation of MapReduce, uses the Hadoop Distributed File System (HDFS).

For exploring MapReduce you can download and experiment with Apache’s Hadoop MapReduce implementation present here.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter


Categories