Hadoop Environment

It seems like one of the easiest ways to get up and running with a Hadoop environment is to use a pre-configured Virtual Box installations from Hortonworks.

Screen Shot 2013-11-24 at 11.30.30 AM

The Mac download is 2.4 GBs.

It comes with a dozen tutorials, this should be a good learning tool.

Update: Download and installed in less than 15 minutes.

What is Map Reduce?

It is quickly becoming obvious that understanding map reduce is critical to understanding Hadoop. This is a term I have heard in the past but never dug beneath the surface. So let’s jump in.

Map Reduce is a programming framework designed to work with large sets of distributed data. I didn’t do too much research on the origins, but it seems it was conceptually designed by Google. Since being created by Google it has been implemented in many different ways, and is a core concept within Hadoop.

From Google:

Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key

MapReduce: Simplfied Data Processing on Large Clusters

As the name implies, MapReduce can be thought of as two distinct processes, the mapping and the reducing.

Mapping refers to taking a set of data and mapping it into another set of data. Individual elements from the set of data are mapped into key value pairs. For example, if the data set was a large document, you could iterate through each word and map each of these words (key) to a count (value). At this point each word stands on it’s own, so multiple occurrences of the same word are separate key,value pairs. Let’s look at an example:

It was the best of times, it was the worst of times…

A map of this would look like this

It, 1
was, 1
the, 1
best, 1
of, 1
times, 1
it, 1
was, 1
the, 1
worst, 1
of, 1
times, 1

The reduce portion simply sums each of the key,value pairs from part 1 into a “reduced” key,value pair. This leaves us with:

it, 2
was, 2
the, 2
best, 1
of, 2
times, 2
worst, 1

This is an extremely simple example, but I believe this is the core idea.

What is commodity hardware?

My definition from Wikipedia left me wanting to know more about commodity hardware. I had never heard the term before. It is pretty simple, basically just low frills but functional machines. The term is usually used when speaking about computer clusters. The machines are typically very similar, meaning parts are interchangeable. They are all running widely available, off the shelf software. Because none of the machines are specialized, one can go down without bringing down the cluster. This is important with Hadoop as it is all about distributing the workload across multiple machines.

There are much better definitions out there, but this simple definition works for me.