Analyzing Hurricane Sandy Related Tweets with Treasure Data
For the full technical write-up, please look at our documentation.
Social Media is an enormous source of information. For example, with more than 350 million Tweets per day, there is a lot of information to be mined. The question is: “where do you start?”
As many of you know, Hurricane Sandy wreaked havoc in Haiti and the East Coast of the United States. We decided to collect Tweets with the word “Hurricane” into Treasure Data and asked a few questions.
Which Hurrican-related Tweet got the most number of eyeballs (at least by one metric)? Here is the query to get the top 10.
$ td query -w -d twitterstream '
SELECT MAX(retweet_count) AS rt_ct, text
FROM hurricane_sandy GROUP BY text
ORDER BY rt_ct DESC LIMIT 510'
And here are the results:
- 61028 RT @justinbieber: everyone dealing with the hurricane up north be safe
- 24868 RT @hurricannesandy: WHAT IF GANGAM STYLE WAS ACTUALLY JSST A GIANT RAIN DANCE AND WE BROUGHT THIS HURRICANE ON OURSELVES?…
- 12549 RT @Seth_Fried: If your apartment is hit by a dolphin, DO NOT GO OUT TO SEE IF THE DOLPHIN IS OKAY. That’s how the hurricane tricks you …
- 11084 RT @AHurricaneSandy: I WENT TO HIGHSCHOOL WIT IRENE. SHE CAN’T EVEN TWERK. SHE AIN’T BOUT DAT HURRICANE LYFE.
- 10779 RT @FillWerrell: Hurricane Sandy wouldn’t be here if Patrick would’ve just stopped making fun of Texas.
It is no surprise that Justin Bieber’s Tweet has been retweeted most. What’s curious is how this Gangnam Style meme ranked the second. The Internet often associates two seemingly disparate entities in creative ways. This is all just for fun until your company or brand is one of the two entities: at that point, you want to identify the association quickly and act on it if it negatively positions your company or brand.
Here is another look at the Tweets. The following word cloud was generated based on the top 30 most popular bigrams. Font sizes are proportional to frequencies with some thresholding to make the layout look nicer.
$td query -w -d twitterstream 'SELECT NGRAMS(SENTENCES(LOWER(text)), 2, 100) FROM hurricane_sandy'
A few observations here:
- Predictably, the most popular bigram was “Hurricane Sandy”.
- Note that “http t.co” is up next. Of course, this comes from Twitter’s official URL shortener. What’s interesting is how other URL shorteners such as “http bit.ly” or “http goo.gl” are absent from the top 30.
What Time Are People Tweeting?
Twitter has become an increasingly important marketing tool. One natural question to ask is, “when are people active on Twitter?”
$td query -w -d twitterstream 'SELECT COUNT(*) AS ct, from_unixtime(3600*ROUND(unix_timestamp(created_at)/3600)) AS time FROM hurricane_sandy GROUP BY ROUND(unix_timestamp(created_at)/3600) ORDER BY time'
The number of Tweets by Hour
At first, there is nothing too surprising here: a gradual increase toward early afternoon, peaking at 2PM. However, note that Twitter has a global presence and people tweet from many timezones. Yet, it seems that this dataset’s timezone dwarfs other timezones. Can you guess which timezone this is? Pacific Standard Time.
After all, this is not too surprising. PDT has Silicon Valley as well as Hollywood, two groups with a number of power Twitter users in them.
Social media provides valuable datasets, but the challenge is getting and analyzing data quickly. With Treasure Data and its versatile td-agent, you can get up and running with your social media analysis in a few hours, not a few weeks or months.