Sunday, June 3, 2012

A combiner works in a reducer as well as a mapper

In Hadoop, a combiner is usually introduced as a way to collapse the output record of maps in the end of a map task and therefore reduce the network traffic and alleviate the load of reducers. A less known function of a combiner that I recently discovered is that  it also works in the reduce phase. (CDH3u3, 0.20.2).  A reducer task, more specifically, its shuffle functions,  usually kicks off before all maps finish.  Once it collects enough records from all maps tasks (controlled by "mapred.inmem.merge.threshold" property) through network, records will be merged and a combiner will be called upon those records if one is specified in the job conf.  This is a very useful feature in the case where there are too many map tasks for the system to start them simultaneously.  Map outputs are collapsed by the combiner in the reducer task during the time, and final reduce run will not need to work no more records than "mapred.inmem.merge.threshold".  Many commercial or third-part versions of Hadoop actually specifically advertise some function similar as "proactive execution" or so.  It is great to know that the plain vanilla Hadoop already has something like this.

No comments :