Welcome to the world of Infosys Engineering! It is a half a billion plus organization that takes pride in shaping our engineering aspirations and dreams and bringing them to fruition. We provide engineering services and solutions across the lifecycle of our clients’ offerings, ranging from product ideation to realization and sustenance, that caters to a cross-section of industries - aerospace, automotive, medical devices, retail, telecommunications, hi tech, financial services, energy and utilities just to name a few major ones.

« February 2010 | Main | July 2010 »

March 30, 2010

MapReduce

MapReduce is a programming model invented by Google for processing web scale data (check Google’s original paper on MapReduce here). Web applications like Facebook, Google Search Engine, Flickr etc., produce and consume huge amount of data. For example Facebook generates around 2 Peta Bytes of data everyday (as per the article here) and processing such a huge amount of data requires computing power of 100s of CPUs and could take several hours to complete. MapReduce aims to simplify this by automatically parallelizing the processing of such a huge amount of data and hiding the details of parallelization, fault tolerance and load balancing from the user. MapReduce is mainly suited for processing data in batch mode and is not suited for interactive applications. For e.g. offline processing of web logs to generate information about users, advertisers, publishers or processing archived data stored on tapes and generating reports, charts etc.

MapReduce drew its inspiration from list processing primitives (Map and Reduce) present in functional programming languages like Lisp (Read about functional programming here). Users of MapReduce need to write two functions: Map and Reduce. Map function operates on a set of key/value pairs (usually key is a file and value is contents of that file) and generates an intermediate set of key/value pairs. Reduce function operates on this intermediate key/value pairs and merges values for the same key into a final output.

For e.g. let’s assume that you have to find a count of different URLs in a collection of 1000s of web logs. Map function which can be applied to each web log can be written as below:

map(string key, string value) :
    // key : file
    // value : contents of the file
    for each URL found in value
          Emit(URL, 1)  ;              // Emit intermediate <key, value> pairs

Here for each URL that is found in the input file, the map function emits <URL, count> pair. For e.g., if “www.yahoo.com” is found 5 times in web log 1, then <www.yahoo.com, 1>  is emitted 5 times.

MapReduce library takes care of accumulating all the values for same key and send that information to 1 or more reducers. The reduce function for this problem can be written as below:

reduce(string key, Iterator values) :
    // key : Intermediate key. E.g. www.yahoo.com
    // value : Values for the same key. E.g. URL count
    int result = 0 ;
    for each value present in values
          result += (int)value ;        //  Calculate the URL count
    Emit(result)                           // Final URL count for URL represented by key
For e.g. if <”www.yahoo.com”, 1> is emitted 100 times by different mappers, the reduce function arrives at final value of URL count for “www.yahoo.com” as 100, after adding up all intermediate values.

Following diagram illustrates operations of Map and Reduce functions (for simplicity only one Reducer is shown. In practice there will be multiple Reducers operating in parallel).
 

MapReduce.JPG

For processing the massive data sets, a set of Mappers (each executing the same Map function), each operating on a subset of data and running on different machines, produce an intermediate data set. Then a set of Reducers, operating on the data produced by different Mappers, produce a final desired result. The key to achieve parallelization is to divide the input data into ‘n’ fragments, each of which will be processed in parallel by ‘n’ different Mappers (as opposed to a single process operating on the entire data, which may take huge amount of time). The intermediate output of ‘n’ Mappers is fed to 1 or more different Reducers, which operating in parallel, produce the end result.

MapReduce library lets users to specify number of Mappers and Reducers. Usually number of Mappers is equal to number of data fragments. Typically the number of Reducers will be lesser than the number of Mappers, as Mappers would have already filtered out unwanted data. Also, as opposed to other distributed data processing libraries like MPI, where data is moved nearer to the computation (i.e. data is moved b/w different machines for processing), MapReduce moves computation nearer to the data (i.e. the executables are copied to the machines where data is present). This results in efficient processing of large data (copying of large data b/w machines is expensive). Distributed file system used by MapReduce library automatically splits huge data sets into several smaller data sets which are evenly spread across a cluster of machines. Google’s implementation of MapReduce uses the Google File System (GFS) (Refer earlier blog on GFS here). Hadoop, which is the open source implementation of MapReduce, uses the Hadoop Distributed File System (HDFS).

For exploring MapReduce you can download and experiment with Apache’s Hadoop MapReduce implementation present here.

March 18, 2010

Global Product Engineering & the “Core to Business” debate…

In a recent discussion with a panel of clients and industry experts on Global engineering trends & growth, the conversation about Product development being at the core, and hence challenges in globalization, once again found some interesting views.   This topic has popped up in many conversations and debates that that I have had in various forums on adoption of global engineering strategies.

The answer to this debate is not so obvious, given that the Products have been at heart of revenue generation for most of the last century. But as the last two decades have shown us, global engineering has found more and more acceptance across industries in product development strategies.

The reasons for this change are multi dimensional -

rise of services, role of ICT, economic leverage, technological leverage, new markets… but the most important one is the fact that the source of competitive advantage continues to evolve for all companies. Hence a blanket assumption that all activities of product engineering are at core does not hold good. The products are now bundled with services and require multi-disciplinary engineering for their development while the markets continue to expand beyond the shores. This means, there is a reconfiguration of value chain as newer skills are required to address these needs. This spawns partnering across the globe to develop and introduce new products.   In this fast changing environment it is important to adopt new competence and maintain flexibility because What was relevant 5 years before may not necessarily carry us for next 5…

Interestingly, India is beginning to emerge as Global engineering hub as Engineering services gain significance in product engineering strategy.  This run-up has seen various factors that provide added dimensions to this “core to Business” debate….It includes not only the vast engineering talent availability in the country but also the fact that

·         Companies are looking for suitable products for this fast growing market

·         Application of new technologies especially play of electronics and embedded software has increased many times requiring access to new skills

·         The Indian Engineering services industry has benefits of learnings from other services industries – IT & BPO success in recent times

·         The understanding of working in multi-cultural environment has become part of the Indian work culture

·         There is a move towards open innovation and these new players bring in a new perspective

Thus product engineering being at the core to business is true when considered with dimensions of business success in a given scenario. The need to leverage competence that provides the edge to products necessitates working with partners in a more open environment.  This means we have to overcome the barrier of traditional thinking that all Product Engineering activities are core in the reconfigured value chains. With good governance, effective change management and IP compliance, the strategy of global engineering pays dividend…

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter


Categories