While I was at #scio14 last week I attended a session hosted by David Shiffman, better known as @whysharksmatter. The session was about using social media as a research tool, with the hash tag #scioResearch. In the session people talked about how to access twitter data, and I blithely remarked "Just use the twitter API, it's pretty easy to access". I said "Just buy your favorite programmer a beer, better yet, buy me a beer and I'll do it."
Sadly, no one offered to buy me a beer :( . However, I figured I should eat my own dog food and actually accept the challenge. But another attendee, Helen Chappell, said she was interested in using basic twitter metrics to understand what people were saying about the museum she works for the North Carolina Museum of Natural Sciences. After the session I said I would give it a crack. So without further ado....
First if you're thinking "I don't want to walk through someone else's code", enjoy my gist of it all. A caveat before we begin, this will work because we're looking through a relatively small number of tweets. I haven't tried this on a large scale, but you probably won't grab all the tweets if there are 10,000+. I'll showing this example in R, but there's no shortage of libraries out there for grabbing twitter data. I've grabbed tweets in both python and R, and while the R one is more user friendly, the python interfaces are more powerful (maybe more on that later). For this though, we'll use the twitteR package to grab some data. Just a note, you'll need to go through the hassle of creating your own api key for any of this to work. The code below will authorize us to search twitter, and then grab our query of interest. In this case it's @naturalsciences because we are interested in all tweets that mention the museum.
library(twitteR) # Set SSL certs globally options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) #### This is all the code you want to authorize with the twitter API and then grab tweets! reqURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" consumerKey <- "YOUR_KEY" consumerSecret <- "YOUR_SECRET" twitCred <- OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL) twitCred$handshake() registerTwitterOAuth(twitCred) ## Do the actual search scioSearch <- searchTwitter(searchString = "@naturalsciences",n=1000) ##Check how many tweets there were
Great, so now we have an R list with all of our tweets. If we check the length of our search results, it's 260, so there were 260 tweets that mentioned @naturalsciences in the last week or so. What do we do with it? Well a pretty quick way to visualize what we got is with a word cloud. Creating word clouds in R is boiler plate stuff, so I wrote a simple function that will extract the full text of all of our search results and make a word cloud.
If we look at this absolute word cloud, we can see that several twitter folks were mentioned a lot (probably via retweets), especially whysharksmatter, DNLee, and ehmee. This makes sense because between the 3 of them, they have ~45k followers. People were also talking about stormfest, scio14, and skinning squirrels. In fact if we were to look at which tweet was retweeted the most, we'd see it's this one.
So it's no wonder this is what comes out in the word cloud. In fact of the last 260 tweets, I calculated that ~50% were retweets, not original tweets.
Now, this gives us a big picture of what is going on, but let's say we want to drill down a bit. Obviously when it comes to analyzing tweets, the major shortcoming is often not getting the data, it's what to do with it. One thing we might be interested is seeing a topics "velocity" on twitter. That is, how rapidly people are tweeting about something. So let's clean up our data, and then do a plot of the cumulative number of tweets vs time.
# create data frame and plot tweets vs time library(plyr) library(ggplot2) ## Grab data from search results, put it into a nice dataframe fullText <- ldply(scioSearch, function(x) return(x$text)) retweets <- ldply(scioSearch, function(x) return(x$retweetCount)) wasRetweet <- ldply(scioSearch, function(x) !is.na(grep("RT",x$text))) # Extract user names screenName <- ldply(scioSearch, function(x) return(x$screenName)) ### Extract time stamps dates <- ldply(scioSearch, function(x) return(as.POSIXct(x$created))) ### View the fraction of retweets print(sum(was_retweet)/length(scioSearch)) ### We could also see the rate of tweeting ### Create tweet date data frame tweetDF <- data.frame(screenName,dates,fullText) ### Add accumulation tweetDF$tweetN <- dim(tweetDF):1 tweetDF$diffs <- rapidTweets(dates$V1,thresh=2500) colnames(tweetDF) <- c("screenName","date","fullText","tweetN","groups") ggplot(tweetDF,aes(x=date,y=tweetN))+geom_point() + theme_bw(20) + xlab("Date") + ylab("Cumulative tweet number")
Now we can see there are a few areas where a lot of tweets are happening with long periods of not much going on. To quantify this, I wrote a quick little algorithm that will label tweets based on group size, and tweet time interval. I applied that to the search results, grouping by tweet speed, and then I text mined each group. In the results below, groups were formed by tweets where more than 10 tweets in a row happened in less than 2500 seconds (~40 minutes). I then color coded the groups, and plotted the 4 most frequent words associated with each group. I won't put all this code in the post, but you can find it here in the gist. Here's the resulting figure:
Now we have a bit of insight into what was driving traffic when, and how that relates to events at the museum. So on Thursday Feb 27th, lot's of folks were talking about @whysharksmatter and @DNLee, #scio14, and maybe something to do with sharks. If I examine the time stamps of the next two groups, I can see that the blue group is the morning, and the red is the afternoon of Feb. 28th. I know that in the afternoon is when the squirrel skinning happened, and that is reflected in all the twee. It seems like in the morning (blue group), people were really interested in something to do with arthropods and weather. By classifying tweets with this "velocity", we can also look at what's happening when these big traffic driving events weren't happening. Below is the word cloud of all the non color coded points (tweets probably not spurred by one big event).
In those non big traffic tweets it looks like people were still excited about the squirrel skinning, something happening on Thursday, and an exhibit on storms, or live storms. One last caveat emptor. When dealing with text mining twitter data, it's an iterative process. In text mining you often have to pull out stop words, so for instance I had to pull out "naturalsciences" because that was always the biggest word. Also people's screen names often stand out if you have lot's of retweets, so it helps to know who some of your biggest retweeters are. So there you have it, a few low hanging twitter metrics that anyone can grab. If there's enough interest in this, I could write a basic package that would make these a bit easier, and if you have others feel free to suggest them.→ ←