occupyresearch - wip-tweet-place-mention-analysis

Genre

This page is a record of a work in progress. It will describe what's underway, methods being used, and will include work files along the way. The goal is to produce a record that makes it straightforward to replicate or extend the analysis undertaken here. The work described here involves contributions from D. Ryan, K. Kupu, and S. Sam of Mills College in Oakland, CA.

Description

During fall 2011 lots occupy tweets included shout-outs, responses, and direct queries of other occupy sites by including hashtags like #occupyoakland in the tweet. In this project, we will parse a corpus of tweets, pulling out all such place mentions. This information will be massaged into several data files that can be analyzed in terms of who mention what places, what places get mentioned together, how do mentions trend over time, etc. Later, if we have a tool for coding tweet content we can match these place mentions to those. And if we can triangulate the location of tweet source (via geo tag, querying API for user location, or the content of the tweet) we can also look at where the mentions are happening.

Research-ish Questions

Common question in sociology of social movements and sociology of information is "how does it diffuse or spread?" We tend to be interested in how fast and via what paths social behavior moves.
Another common analytical move is to use co-mention data (that is, two things appearing in the same utterance, text, etc.) as a very approximate indicator that the two things are similar or related. Thus if #occupyoakland and #occupywichita frequently appear together in tweets but #occupyoakland almost never appears in a tweet with #occupyabilene then, as a first approximation, we might assume there is more of a connection between Oakland and Wichita than between Oakland and Abilene.

Data and Tools

Tweet data (some that we collected at Mills College using Archivist desktop program and some from the R-Shief archive).
Miscellaneous web resources for identifying hashtags and getting geographic coordinates for locations.
Spreadsheet (I'm using Excel but most of what I'm doing should be possible in OpenOffice or Google Docs)
Python
NodeXL
Gephi

Procedure

In fall 2011 we collected a large sample of tweets using the Archivist desktop program. Data collection and initial scrubbing and formatting will be described elsewhere. End result is about 3.5 million tweet records in tab delimited format:

user

tweet text

tweetID

yyyy

timedata

The R-Shief data is similar though it is comma separated values (csv) format:

Twitter ID

Text

Image URL

mm/d/yyyy

Hour

Minute

Created At (mm/d/yyyy hh:mm:ss PM)

Geo

From User

From User ID

Language

To User

To User ID

Source

Step One: Create a Catalog of Place Hashtags

A simple Python program was used to read tweet texts and extract all hashtags (several thousand). Here's the pseudo-code version of the program

open file
while there is still more data
1. read a line
2. split the line at the tabs
3. while there are more hashtags
  1. add hashtag to our list if it's a new one
save the list of hasttags

The resulting list was then inspected manually keeping only hashtags that in some way referred to places.

Step Two: Create an "Actual Place" Concordance for the Hashtags

Some hash tags scream their geography loud and clear (e.g., "#occupystlouismissouri"), others require some decoding (e.g., "occupyfs" for Finsbury Square in London) and many contain typos and/or are just plain cryptic (e.g., #occupoychi). We want our concordance to keep track of all of these. It will be our way of translating misspellings and hash tag synonyms into unambiguous geographic names.

After the hashtags are decoded we add other levels of geographic information (again, mostly manually) so that we have city, state, and country (as appropriate -- sometimes the tag refers to a whole state or country).

Step Two: Grab Longitude and Latitude Data for Places

Here we grab various sources of long/lat data from the web and copy and paste them into a spreadsheet, clean them up a bit and then use a lookup function to match place names in our concordance to the long/lat data. We are bedeviled by typos and the fact that there is no one single best source for long lat data for cities, states, and countries. The final step was manual search for places not able to match automatically. The resulting file has 799 records with this record structure

tag	note	country	state	lookupname		long	lat

hashtagplaces.csv

The following tags look like they might refer to locations but have not yet been identified.

#obvs, #obvt, #obw, #obx, #ocala, #ocalan, #ocalled, #ocan, #ocap, #ocar, #ocaw, #ocb, #occamp, #occampy, #occc, #occcoll, #occcupyldnont, #occcupylockenest, #occupusancto, #occupy_aeco, #occupy_dee, #occupy_en, #occupy_er, #occupy_ers, #occupy_ert, #occupy_erts, #occupy_erweiner, #occupy_lari, #occupy_lo_que_sea, #occupy_marcus, #occupy_tuigonkruid, #15ocr, #occupemoslacama, #occupoint

Step Three: Read and Parse Tweets

We will read tweets and catch retweets, mentions, hashtags in general and location hashtags. We'll describe how to detect these one at a time.

A note on retweets. There is not a single standard convention for retweets. We see some with RT followed by a user name and a colon, some with "via," some with "MT" (for modified tweet) and various combinations of these. Sometimes we see a retweet of a retweet and there's an effort to indicate that and sometimes there is not. Finally, there are retweets that do not mention that they are. All told, this means that our initial counts are probably undercounts and we would need several extra processing steps to match tweets and retweets to chain them together. More on this later.