All Projects → MohamedHmini → tweetsOLAPing

MohamedHmini / tweetsOLAPing

Licence: other
implementing an end-to-end tweets ETL/Analysis pipeline.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to tweetsOLAPing

Twitter
Twitter API for Laravel 5.5+, 6.x, 7.x & 8.x
Stars: ✭ 755 (+3045.83%)
Mutual labels:  tweets, twitter-api, api-client
Twitterdelete
💀 Delete your old, unpopular tweets.
Stars: ✭ 231 (+862.5%)
Mutual labels:  tweets, twitter-api
Birdseed
🐦 🎲 Use Twitter's Search API to get random numbers
Stars: ✭ 81 (+237.5%)
Mutual labels:  tweets, twitter-api
stweet
Advanced python library to scrap Twitter (tweets, users) from unofficial API
Stars: ✭ 287 (+1095.83%)
Mutual labels:  tweets, twitter-api
Linqtotwitter
LINQ Provider for the Twitter API (C# Twitter Library)
Stars: ✭ 401 (+1570.83%)
Mutual labels:  tweets, twitter-api
Twitter Post Fetcher
Fetch your twitter posts without using the new Twitter 1.1 API. Pure JavaScript! By Jason Mayes
Stars: ✭ 886 (+3591.67%)
Mutual labels:  tweets, twitter-api
TwitterAutoReplyBot
This is a tiny Python script that replies to a specified number of tweets containing a specified hashtag.
Stars: ✭ 33 (+37.5%)
Mutual labels:  tweets, twitter-api
Nba api
An API Client package to access the APIs for NBA.com
Stars: ✭ 881 (+3570.83%)
Mutual labels:  analysis, api-client
tweet-delete
Self-destructing Tweets so you too can be cool 😎
Stars: ✭ 68 (+183.33%)
Mutual labels:  tweets, twitter-api
SMMT
Social Media Mining Toolkit (SMMT) main repository
Stars: ✭ 116 (+383.33%)
Mutual labels:  tweets, twitter-api
trumptweets
Download data on all of Donald Trump's (@RealDonaldTrump) tweets
Stars: ✭ 39 (+62.5%)
Mutual labels:  tweets, twitter-api
Tweetie
Simple jQuery Twitter feed plugin
Stars: ✭ 314 (+1208.33%)
Mutual labels:  tweets, twitter-api
Twitter Scraper
Scrape the Twitter Frontend API without authentication.
Stars: ✭ 3,037 (+12554.17%)
Mutual labels:  tweets, twitter-api
archive-explorer-web
Browse your Twitter archive with a friendly, responsive, full experience, and quickly delete the tweets you don't want.
Stars: ✭ 19 (-20.83%)
Mutual labels:  tweets, twitter-api
Archive-Tweets
Archive and Delete Liked and Posted Tweets
Stars: ✭ 28 (+16.67%)
Mutual labels:  tweets, twitter-api
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-16.67%)
Mutual labels:  datawarehouse, etl-pipeline
Twitter Api Client
A user-friendly Node.js / JavaScript client library for interacting with the Twitter API.
Stars: ✭ 517 (+2054.17%)
Mutual labels:  twitter-api, api-client
keen-analysis.js
A light JavaScript client for Keen
Stars: ✭ 40 (+66.67%)
Mutual labels:  analysis, api-client
nitter scraper
Scrape Twitter API without authentication using Nitter.
Stars: ✭ 31 (+29.17%)
Mutual labels:  tweets, twitter-api
discord-twitter-webhooks
🤖 Stream tweets to Discord
Stars: ✭ 47 (+95.83%)
Mutual labels:  tweets, twitter-api

tweetsOLAPing : an end-to-end social-media data-warehousing project :

made-with-python Ask Me Anything !

i'll walk you through an execution example using light-weight (very) data to show you the results.

Table of Contents :

Twitter DataTypes :

TweetDataType :

{
   "created_at":"Sat Jul 01 23:47:16 +0000 2017",
   "id":881298189072072708,
   "id_str":"881298189072072708",
   "text":"seu perfil foi visto por 5 pessoas nas \u00faltimas 4 horas https:\/\/t.co\/cKb35CahC7",
   "source":"\u003ca href=\"http:\/\/www.twitcom.com.br\" rel=\"nofollow\"\u003eTwitcom - Comunidades \u003c\/a\u003e",
   "truncated":false,
   "in_reply_to_status_id":null,
   "in_reply_to_status_id_str":null,
   "in_reply_to_user_id":null,
   "in_reply_to_user_id_str":null,
   "in_reply_to_screen_name":null,
   "user": UserDataType,
   "geo":null,
   "coordinates":null,
   "place":null,
   "contributors":null,
   "is_quote_status":false,
   "retweet_count":0,
   "favorite_count":0,
   "entities":{
      "hashtags":[

      ],
      "urls":[
         {
            "url":"https:\/\/t.co\/cKb35CahC7",
            "expanded_url":"http:\/\/twcm.me\/5bHmW",
            "display_url":"twcm.me\/5bHmW",
            "indices":[
               55,
               78
            ]
         }
      ],
      "user_mentions":[

      ],
      "symbols":[

      ]
   },
   "favorited":false,
   "retweeted":false,
   "possibly_sensitive":false,
   "filter_level":"low",
   "lang":"pt",
   "timestamp_ms":"1498952836660"
}

UserDataType :

{
   "id":2696402179,
   "id_str":"2696402179",
   "name":"$AVAGE",
   "screen_name":"SavageHumor",
   "location":null,
   "url":null,
   "description":"SAVAGE TWEETS \nWARNING: 18+ Content",
   "protected":false,
   "verified":false,
   "followers_count":150201,
   "friends_count":0,
   "listed_count":94,
   "favourites_count":85,
   "statuses_count":10696,
   "created_at":"Thu Jul 31 18:52:37 +0000 2014",
   "utc_offset":-18000,
   "time_zone":"Central Time (US & Canada)",
   "geo_enabled":false,
   "lang":"en",
   "contributors_enabled":false,
   "is_translator":false,
   "profile_background_color":"000000",
   "profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
   "profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
   "profile_background_tile":false,
   "profile_link_color":"DD2E44",
   "profile_sidebar_border_color":"000000",
   "profile_sidebar_fill_color":"000000",
   "profile_text_color":"000000",
   "profile_use_background_image":false,
   "profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/875059551204249601\/J_XlKaiO_normal.jpg",
   "profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/875059551204249601\/J_XlKaiO_normal.jpg",
   "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2696402179\/1416368695",
   "default_profile":false,
   "default_profile_image":false,
   "following":null,
   "follow_request_sent":null,
   "notifications":null
}

0- ENV set-up:

the extraction/transformation steps of the pipeline will need the following environment set-up :

pip install virtualenv
virtualenv tweetsOLAPingENV
source tweetsOLAPingENV/bin/activate
pip install -r requirements.txt

as far as the Loading ETL step and the final analysis, make sure you have the following :

  1. MSSMS (microsoft Sql server managment studio).
  2. SSDT (Sql server data tools).
  3. SSIS (Sql server integration service).
  4. SSAS (Sql server analysis service).
  5. powerBI.

1- ETL pipeline :

a) Extraction :

the very first step is to prepare the tweetsPOOLs.csv file as in https://github.com/MohamedHmini/tweetsOLAPing/blob/master/extraction/archivedTweetsCrawler/tweetsPOOLs.csv.

then we shall execute the Scrapy Spider to crawl the archive needed website pages as follows :

  cd extraction/archivedTweetsCrawler
  scrapy crawl -o tweetsSTREAMs.csv tweets

after that you will get a file like the one in here : https://github.com/MohamedHmini/tweetsOLAPing/blob/master/sample-data/tweetsSTREAMs.csv

the next step is to structure that CSV file into a tree like structure composed of directories and files :

  cd extraction/
  python tweetsPOOLsParser.py tweetsSTREAMs.csv ../root_urls/

the output will be somewhat like this (light-weight) example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/urls_root

next we have to perform a random selection to select only some URLs and not all, note that each URL will bring you up to 5000 tweets :

  cd extraction/
  python urlsRandomSelector.py ../root_urls/ ../chosen_urls.txt 700

again check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/blob/master/sample-data/chosen_urls.txt

now after that we have all the needed URLs in a single file we can start downloading :

  cd extraction/
  python tweetsDownloader.py ../chosen_urls.txt ../downloaded_pools/ ../download_error.txt

again check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/downloaded-pools

after we downloaded the files you will notice that they are compressed with a .bz2 file extension thus you have to decompress them somehow, i won't provide a solution in this stage.

again check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/decompressed-pools

note that i provide a script to lookup tweets from the twitterAPI directly using the downloaded tweets IDs, cause the tweets have been pulled in the stream by the collected you will find that most of them has zero metrics, i solve this solution using a context-aware random generator.

b) Transformation :

as for the transformation it's composed of two parts, first we transform our data from JSON to CSV and create all the needed derived attributes, also we shall remove twitto duplicates using, beware that the cleanUsersCSV.py script will using multi-threading to speed-up the I/O operations and the result will be stored in a directory, you can then merge them on your own.

  cd transformation/
  python prepareTweets.py ../decompressed_pools/ ../tweets.csv ../twittos.csv ../trans-err.txt
  python cleanUsersCSV.py ../twittos.csv ../twittos

the second part consists of performing NLP analysis on the tweets to generate the sentiment-score and the content-classification, you have to provide the projectkey.json file from google NLP APIin the same directory.

  cd transformation/
  python performNLPanalysis.py ../decompressed_pools/ ../tweets_sentiments.csv ../sent-err.txt

again check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/processed

c) Loading :

before starting the SSIS process you have to provide a normalized data in the right path (shall be fixed) :

  cd loading/
  python dataNormalization.py ../twittos.csv ../tweets.csv ../data/normalized

again check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/normalized-data

for SSIS logic i provide the full model : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/loading/tweetsOLAPing_loading

as well as the SSAS logic is fully provided : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/analysis/tweetsOLAPing_analysis

SSIS modeling :

after setting up all the connections ( the normalized data as well as the OLEDB destinations), we now arrive at the integration step or the data loading :

SSAS cube modeling :

2. Analysis :

MDX queries :

SELECT 
  NON EMPTY
  (
    [Twitto Meta Data].[User Category].children, 
    [Measures].[Retweet Count]
  ) ON COLUMNS,
  NON EMPTY 
  (
    [Twitto Meta Data].[User Activity].children
  ) ON ROWS
FROM
  [TweetsOLAPing_cube]
SELECT 
  NON EMPTY
  (
    [Twitto Meta Data].[User Category].children, 
    [Measures].[Retweet Count]
  ) ON COLUMNS,
  NON EMPTY 
  (
    [Twitto Meta Data].[User Activity].children, 
    [Tweet Meta Data].[Sentiment Tag].children
  ) ON ROWS
FROM
  [TweetsOLAPing_cube]
SELECT 
  NON EMPTY
  (
    [Tweet Meta Data].[Media Type].children, 
    [Measures].[Retweet Count]
  ) ON COLUMNS,
  NON EMPTY 
  (
    [Tweet Meta Data].[Has Hashtags].children, 
    [Date].[Weekday].children
  ) ON ROWS
FROM
  [TweetsOLAPing_cube]

powerBI report :

the final report is provided in : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/analysis/powerBI

here are some examples :

alt text alt text alt text alt text alt text

references :

[1] Maha Ben Kraiem, Jamel Feki, Ka¨ıs Khrouf, Franck Ravat, Olivier Teste. OLAP of the tweets: From modeling to exploitation. IEEE International Conference on Research Challenges in Information Science, Marrakesh, Morocco, May 2014.

[2] Maha Ben Kraiem, Jamel Feki, Ka¨ıs Khrouf, Franck Ravat, Olivier Teste. OLAP4Tweets: Multidimensional Modeling of tweets. 19th East-European Conference on Advances in Databases and Information Systems, Poitiers, France, September 2015.

[3] Nafees Ur Rehman, Svetlana Mansmann, Andreas Weiler, Marc H. Scholl. Building a DataWarehouse for Twitter Stream Exploration. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Istanbul, turkey, August 2012.

MOHAMED-HMINI 2020

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].