It is quickly becoming obvious that understanding map reduce is critical to understanding Hadoop. This is a term I have heard in the past but never dug beneath the surface. So let’s jump in.
Map Reduce is a programming framework designed to work with large sets of distributed data. I didn’t do too much research on the origins, but it seems it was conceptually designed by Google. Since being created by Google it has been implemented in many different ways, and is a core concept within Hadoop.
From Google:
Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key
MapReduce: Simplfied Data Processing on Large Clusters
As the name implies, MapReduce can be thought of as two distinct processes, the mapping and the reducing.
Mapping refers to taking a set of data and mapping it into another set of data. Individual elements from the set of data are mapped into key value pairs. For example, if the data set was a large document, you could iterate through each word and map each of these words (key) to a count (value). At this point each word stands on it’s own, so multiple occurrences of the same word are separate key,value pairs. Let’s look at an example:
It was the best of times, it was the worst of times…
A map of this would look like this
It, 1
was, 1
the, 1
best, 1
of, 1
times, 1
it, 1
was, 1
the, 1
worst, 1
of, 1
times, 1
The reduce portion simply sums each of the key,value pairs from part 1 into a “reduced” key,value pair. This leaves us with:
it, 2
was, 2
the, 2
best, 1
of, 2
times, 2
worst, 1
This is an extremely simple example, but I believe this is the core idea.