NOTE: If you like this, you'll also like my next project, which performs sentiment analysis on Tweets by keyword!

Twitter AWS Comprehend

I recently learned of Amazon Comprehend and wanted to play around with its sentiment analysis.

So I built this app to download user timelines from Twitter, send them to AWS for analysis, and visualize them in Splunk. The following metrics are reported:

Start and end dates for tweets
Number of tweets
A graph of "Sentiment Over Time"
Number of F-bombs used
Net Happiness Index (percent of happy tweets minus precent of unhappy tweets)
Top Positive and Negative tweets

Screenshots

Additional screenshots are available in the img/ directory.

Requirements

An AWS Account
The AWS Command Line Interface installed and configured
Python 3
Run the command pip install -r requirements.txt to download all required packages
A Twitter app created at https://apps.twitter.com/. Read-only access is fine.
A running Splunk Instance. A free copy of Splunk can be downlaoded from Splunk.com.

Getting started

Downloading Tweets

You'll want to start off by running the script ./0-fetch-tweets -u username -n num_tweets_to_download to download Tweets via Twitter's API. When you first run the script, it will notice the lack of credentials and send you over to Twitter's App page, where you'll need to create an app. Then grab the App Key and App Secret and enter them when the script prompts you. Next, you'll be sent over to Twitter one more time and will receive a PIN to enter in the script. Do so, and you'll be authenticated to Twitter. This is a one-time process, so once you do it, you should not need to do it again.

The maximum number of tweets you can download from Twitter's API is 3200, but the actual number you get will be much lower as RTs are ignored and Twitter's API is really weird about giving you the actual number of tweets that you ask for. I do not understand it.

Analyizing Tweets

WARNING: This costs money! Based on AWS's pricing structure, a tweet will be treated as 3 "units", which will cost you $.0003, or 3 hundredths of a cent to analyze. So 100 tweets will cost 3 cents, while 1,000 tweets will cost 30 cents.

The syntax for the script to analyize sentment is 1-analyze-sentiment -u username -n num_tweets [ --fake ]

I strongly encourage you to run the script with --fake on the first few tries so that you can fake calls to AWS and get comfortable running the script.

Feeding the analyzed tweets to Splunk

The syntax for the script to feed the tweets into Splunk is: 2-ingest-into-splunk -u username [ --splunk-port port ] [ --splunk-host hostname ] Defaults are 9997 and localhost, respectively.

The data is sent to Splunk over a raw TCP connection, so you'll want to configure Splunk accordingly. Here's a screenshot to help with that:

You'll want to have this source saving to the main Index.

Visualization

This is the most interesting part. So far, we are making the following assumptions about Splunk:

Use of the main Index
Use of the Sourcetype twitter
Use of the Splunk app Search

Assuming those are the case, you're good to go! Just copy the file splunk/twitter_activity_sentiment.xml into $SPLUNK_HOME/etc/apps/search/local/data/ui/views, restart Splunk, and you should be all set!

Alternatively, a less convoluted way (which does not require restarting Splunk) would be to create a new dashboard, click Edit, click Source, and paste in the contents of twitter_activity_sentiment.xml.

A Word on Idempotency

I am a HUGE fan of Idempotency. Especially because AWS Comprehend costs money! Once I analyze a tweet, I never want to analyze it again. So I made a conscious choice to build my code that way. So, for example, if a tweet is analyzed and later the script 0-fetch-tweets is run, that code will not overwrite the sentiement fields. And once a tweet is analyzed by 1-analyze-sentiemtn, it will never be analyzed again!

One place where this does break down is with Slplunk, since the data is fed in through raw TCP and Splunk does not seem to give any acknowledgement (don't know why...), running that script twice will result in duplicate events. The way around that is to run a Splunk query like index=main sourcetype=twitter username=dmuth | delete before re-ingesting any data. I'm not thrilled with this particular workflow, and am looking at some alternatives.

Future TODO Items

~~Make tweet ingestion idempotent~~
~~See about using Twitter's search API to get older tweets~~ Seriously, Twitter. Let us get more than 3,200 Tweets through your API!
Come up with a metric to measure profanity on an account, not just f-bombs
Add "username" field to the database schema so we can analyze multiple users at once
Dockerize this to download a user's tweets, analyzes them, exports them, then loads up a Splunk instance to ingest them

Contact

I had fun writing this, and I hope you had enjoy using this. If there are any issues, feel free to file an issue against this project, hit me up on Twitter or Facebook, or drop me a line: dmuth AT dmuth DOT org.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

dmuth / twitter-aws-comprehend

Programming Languages

Labels

Projects that are alternatives of or similar to twitter-aws-comprehend