MapReduce drew its inspiration from list processing primitives (Map and Reduce) present in functional programming languages like Lisp (Read about functional programming here). Users of MapReduce need to write two functions: Map and Reduce. Map function operates on a set of key/value pairs (usually key is a file and value is contents of that file) and generates an intermediate set of key/value pairs. Reduce function operates on this intermediate key/value pairs and merges values for the same key into a final output.For e.g. let’s assume that you have to find a count of different URLs in a collection of 1000s of web logs. Map function which can be applied to each web log can be written as below:
map(string key, string value) :
// key : file
// value : contents of the file
for each URL found in value
Emit(URL, 1) ; // Emit intermediate <key, value> pairs
MapReduce library takes care of accumulating all the values for same key and send that information to 1 or more reducers. The reduce function for this problem can be written as below:reduce(string key, Iterator values) :
// key : Intermediate key. E.g. www.yahoo.com
// value : Values for the same key. E.g. URL count
int result = 0 ;
for each value present in values
result += (int)value ; // Calculate the URL count
Emit(result) // Final URL count for URL represented by keyFor e.g. if <”www.yahoo.com”, 1> is emitted 100 times by different mappers, the reduce function arrives at final value of URL count for “www.yahoo.com” as 100, after adding up all intermediate values.
Following diagram illustrates operations of Map and Reduce functions (for simplicity only one Reducer is shown. In practice there will be multiple Reducers operating in parallel).
For processing the massive data sets, a set of Mappers (each executing the same Map function), each operating on a subset of data and running on different machines, produce an intermediate data set. Then a set of Reducers, operating on the data produced by different Mappers, produce a final desired result. The key to achieve parallelization is to divide the input data into ‘n’ fragments, each of which will be processed in parallel by ‘n’ different Mappers (as opposed to a single process operating on the entire data, which may take huge amount of time). The intermediate output of ‘n’ Mappers is fed to 1 or more different Reducers, which operating in parallel, produce the end result.
MapReduce library lets users to specify number of Mappers and Reducers. Usually number of Mappers is equal to number of data fragments. Typically the number of Reducers will be lesser than the number of Mappers, as Mappers would have already filtered out unwanted data. Also, as opposed to other distributed data processing libraries like MPI, where data is moved nearer to the computation (i.e. data is moved b/w different machines for processing), MapReduce moves computation nearer to the data (i.e. the executables are copied to the machines where data is present). This results in efficient processing of large data (copying of large data b/w machines is expensive). Distributed file system used by MapReduce library automatically splits huge data sets into several smaller data sets which are evenly spread across a cluster of machines. Google’s implementation of MapReduce uses the Google File System (GFS) (Refer earlier blog on GFS here). Hadoop, which is the open source implementation of MapReduce, uses the Hadoop Distributed File System (HDFS).
For exploring MapReduce you can download and experiment with Apache’s Hadoop MapReduce implementation present here.