Save tweets to file using Python

Below is python code to search twitter and export the results to a csv file. This is a good way to build up a database of text to use with Hadoop. I’ll be writing posts on analyzing twitter data in the future.

The code contains the following variables:

TWITTER_APP_KEY
TWITTER_APP_KEY_SECRET
TWITTER_ACCESS_TOKEN
TWITTER_ACCESS_TOKEN_SECRET

You must obtain these by going to https://dev.twitter.com/apps

This code also uses the twython Python module, so you must install this as well.

Here is the code. I’m far from a Python expert (or even novice) so please provide any improvements in the comments.

from twython import Twython, TwythonError
import string, json, pprint
import urllib
from datetime import timedelta
from datetime import date
from time import *
import string, os, sys, subprocess, time
import cvs
TWITTER_APP_KEY = '' #supply the appropriate values
TWITTER_APP_KEY_SECRET = '' 
TWITTER_ACCESS_TOKEN = ''
TWITTER_ACCESS_TOKEN_SECRET = ''
harvest_list = ['#snow']
#put the words you want to search for here
c = csv.writer(open("tweetfile.csv", "wb"))
for tweet_keyword in harvest_list: 
 twitter = Twython(app_key=TWITTER_APP_KEY, 
 app_secret=TWITTER_APP_KEY_SECRET, 
 oauth_token=TWITTER_ACCESS_TOKEN, 
 oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)
 try:
 search_results = twitter.search(q=tweet_keyword, count="500", lang='en')
 # our search for the current keyword
 except TwythonError as e:
 print e
tweets=search_results['statuses']
for tweet in tweets:
 try:
 c.writerow([str(tweet['text'].encode('utf-8').replace("'","''").replace(';',''))])
 except:
 print "############### Unexpected error:", sys.exc_info()[0], "##################################"

Wordcount with Pig

Generating a word count using Pig is fairly simple.

First, load the data file into HDFS. Make note of the location, you will need it in the next step. In the example below the original datafile is a text file (document_text.txt).

Second, write a pig script to read the data, count the words and store in a new datafile (wordcount).

a = load '/user/hue/document_text.txt';
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/wordcount';

At this point you have a datafile you can manipulate as you please. For example, you could create a table using HCatalog and analyze using Hive.

Baseball Data Archive

As I work through the Hortonworks tutorials, I have come across a great source of data. Sean Lahman publishes a huge archive of baseball data going back all the way to 1871, for free. I can’t believe I haven’t come across this before. Not only is it great data to use when practicing with Hadoop, it is a great resource for baseball fans. It looks like he has a lot of other great information available on the site as well.

What is Pig?

A high-level language to produce the MapReduce jobs needed to process data.

Made up of:

Pig Latin – High level scripting language, translated to MapReduce jobs

Grunt – Interactive shell

Piggybank – Shared repository for User Defined Functions (UDFs)

 

We use the HCatLoader() function within a Pig script to read data from HCatalog. For example:

a = LOAD ‘master_data’ USING org.apache.hcatalog.pig.HCatLoader();
b =  LOAD ‘detail_data’ USING org.apache.hcatalog.pig.HCatLoader();

Joins are similarly simple:

c = JOIN a BY id, b BY id;

(where id is the common data field in both tables)

Data is not loaded or transformed until there is a DUMP or Store command:

DUMP c;

What is HCatalog?

HCatalog is a table and storage management layer that sits between HDFS and the different tools used to process the data (Pig, Hive, Map Reduce, etc). It presents users with a relational view of the data. It can be thought of as a data abstraction layer.

One of the advantages of using HCatalog is the user does not have to worry about what format the data is stored in. The data can be text, RC file format etc. Also, the user does not need to know where the data is stored.

The data from HCatalog is stored in tables, these tables can be placed in databases.

HCatalog also has a CLI. For example:

hcat.py -f myscript.hcatalog (Executes the myscript script file)

To Create a new table in HCatalog via the GUI (These steps use the Hortonworks Sandbox):

1) Select “Create new table from file”

2) Enter the table name

3) Click “Choose a file” and browse to the data file in HDFS

4) Change the file options (delimiters, encoding, etc), if necessary

5) In the table preview, change the column names or column data types, if necessary

6) Click “Create Table”

 

 

 

What is Apache Hive?

This project from Apache is a data warehouse view of the data in HDFS. It allows the data to be queried using a SQL like language (HiveQL). There are other ways to query HDFS data (i.e. Pig) but Hive is popular because of the resemblance to SQL.

Hive inherits schema and location information from HCatalog, meaning this information does not need to be provided to Hive. Without HCatalog the tables and schema information (location) would need to be created using HiveQL.

Hive can be accessed through a web interface, a command-line interface and from HDinsight.

Hive statements look very similar (if not exactly the same) as SQL. For example, in Hive we can write

select * from table_data

Joins can also be made across multiple tables. This syntax also looks exactly the same as SQL.

What is Map Reduce?

It is quickly becoming obvious that understanding map reduce is critical to understanding Hadoop. This is a term I have heard in the past but never dug beneath the surface. So let’s jump in.

Map Reduce is a programming framework designed to work with large sets of distributed data. I didn’t do too much research on the origins, but it seems it was conceptually designed by Google. Since being created by Google it has been implemented in many different ways, and is a core concept within Hadoop.

From Google:

Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key

MapReduce: Simplfied Data Processing on Large Clusters

As the name implies, MapReduce can be thought of as two distinct processes, the mapping and the reducing.

Mapping refers to taking a set of data and mapping it into another set of data. Individual elements from the set of data are mapped into key value pairs. For example, if the data set was a large document, you could iterate through each word and map each of these words (key) to a count (value). At this point each word stands on it’s own, so multiple occurrences of the same word are separate key,value pairs. Let’s look at an example:

It was the best of times, it was the worst of times…

A map of this would look like this

It, 1
was, 1
the, 1
best, 1
of, 1
times, 1
it, 1
was, 1
the, 1
worst, 1
of, 1
times, 1

The reduce portion simply sums each of the key,value pairs from part 1 into a “reduced” key,value pair. This leaves us with:

it, 2
was, 2
the, 2
best, 1
of, 2
times, 2
worst, 1

This is an extremely simple example, but I believe this is the core idea.