Wordcount with Pig

Generating a word count using Pig is fairly simple.

First, load the data file into HDFS. Make note of the location, you will need it in the next step. In the example below the original datafile is a text file (document_text.txt).

Second, write a pig script to read the data, count the words and store in a new datafile (wordcount).

a = load '/user/hue/document_text.txt';
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/wordcount';

At this point you have a datafile you can manipulate as you please. For example, you could create a table using HCatalog and analyze using Hive.