All Projects → MatthewWolff → TwitterScraper

MatthewWolff / TwitterScraper

Licence: MIT license
Scrape a User's Twitter data! Bypass the 3,200 tweet API limit for a User!

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to TwitterScraper

twpy
Twitter High level scraper for humans.
Stars: ✭ 58 (-27.5%)
Mutual labels:  twitter-bot, scraper, twitter-api
TinyFlowerBeds
Educational bot that posts a tiny flower bed on Twitter every few hours. Check it out if you're new to Python and open source!
Stars: ✭ 12 (-85%)
Mutual labels:  twitter-bot, twitter-api, tweepy
TwitterAutoReplyBot
This is a tiny Python script that replies to a specified number of tweets containing a specified hashtag.
Stars: ✭ 33 (-58.75%)
Mutual labels:  twitter-bot, twitter-api, tweepy
Raymo111
My awesome profile README
Stars: ✭ 110 (+37.5%)
Mutual labels:  twitter-bot, twitter-api
stock reminder bot
A twitter bot that reminds you of stock and crypto predictions
Stars: ✭ 25 (-68.75%)
Mutual labels:  twitter-bot, tweepy
twitter-bot
Simple Twitter bot using Tweepy and Python
Stars: ✭ 16 (-80%)
Mutual labels:  twitter-bot, tweepy
kafka-twitter-spark-streaming
Counting Tweets Per User in Real-Time
Stars: ✭ 38 (-52.5%)
Mutual labels:  twitter-api, tweepy
ColegaDondeEstaMiTFM
Un bot de Twitter que comparte cada hora un TFM hasta que Cristina Cifuentes enseñe el suyo.
Stars: ✭ 14 (-82.5%)
Mutual labels:  twitter-bot, scraper
twittered
Twitter API client for Java developers
Stars: ✭ 170 (+112.5%)
Mutual labels:  twitter-bot, twitter-api
LGBTQ-of-the-day-bot
Twitter bot tweeting fun goofy lgbtq things and queer history of the day
Stars: ✭ 65 (-18.75%)
Mutual labels:  twitter-bot, tweepy
tweet-delete
Self-destructing Tweets so you too can be cool 😎
Stars: ✭ 68 (-15%)
Mutual labels:  twitter-bot, twitter-api
stweet
Advanced python library to scrap Twitter (tweets, users) from unofficial API
Stars: ✭ 287 (+258.75%)
Mutual labels:  scraper, twitter-api
TwitterPiBot
A Python based bot for Raspberry Pi that grabs tweets with a specific hashtag and reads them out loud.
Stars: ✭ 85 (+6.25%)
Mutual labels:  twitter-api, tweepy
Twitterbot en 30
Los bots son relevantes para nosotros, pues automatizan tareas que eventualmente simplificarán trabajo a futuro. En este taller relámpago aprenderemos cómo crear un bot en Twitter usando Python, para de manera automática tuitear frases de un libro. ¡En sólo 30 minutos!
Stars: ✭ 19 (-76.25%)
Mutual labels:  twitter-bot, tweepy
discord-twitter-webhooks
🤖 Stream tweets to Discord
Stars: ✭ 47 (-41.25%)
Mutual labels:  twitter-api, tweepy
TgTwitterStreamer
Continous Integration from Twitter to Telegram.
Stars: ✭ 55 (-31.25%)
Mutual labels:  twitter-bot, tweepy
twitter-like-bot
This app allows you to automate Twitter liking for specific keywords, hashtags, or even full sentences. The bot uses streaming API which means that everything happens in real time.
Stars: ✭ 30 (-62.5%)
Mutual labels:  twitter-bot, twitter-api
twitter-crypto-bot
This is a Twitter bot that tweets about cryptocurrencies prices every certain amount of minutes
Stars: ✭ 21 (-73.75%)
Mutual labels:  twitter-bot, twitter-api
larry
Larry 🐦 is a really simple Twitter bot generator that tweets random repositories from Github built in Go
Stars: ✭ 64 (-20%)
Mutual labels:  twitter-bot, twitter-api
rust-trending
A twitter bot (@RustTrending) to tweet trending rust repositories, inspired by @TrendingGithub
Stars: ✭ 113 (+41.25%)
Mutual labels:  twitter-bot

TwitterScraper

Description

Twitter's API limits you to querying a user's most recent 3200 tweets. This is a pain in the ass. However, we can circumvent this limit using Selenium and doing some webscraping.

We can query a user's entire time on twitter, finding the IDs for each of their tweets. From there, we can use the tweepy API to query the complete metadata associated with each tweet. You can adjust which metadata are collected by changing the variable METADATA_LIST at the top of scrape.py. Personally, I was just collecting text to train a model, so I only cared about the full_text field in addition to whether the tweet was a retweet.

I've included a list of all available tweet attributes at the top of scrape.py so that you can adjust things as you wish.

NOTE: This scraper will notice if a user has less than 3200 tweets. In this case, it will do a "quickscrape" to grab all available tweets at once (significantly faster). It will store them in the exact same manner as a manual scrape.

Requirements (or rather, what I used)

  • python3 (3.7.3)
  • Modules (via pip):
    • selenium (3.141.0)
    • tweepy (3.8.0)
    • requests (2.21.0)
    • requests_oauthlib (1.3.0)
    • beautifulsoup4 (4.7.1)
  • Chrome webdriver (you can use other kinds. Personally, brew install chromedriver)
  • Twitter API developer credentials

Example:

I'll run the script two times on one of my advisors. By default, the scraper will start whenever the user created their twitter. I've chosen to look at a 1 year window, scraping at two week intervals. I then go from the beginning of 2019 until the present day, at 1 week intervals. The scraped tweets are stored in a JSON file that bears the Twitter user's handle.

$ ./scrape.py --help
usage: python3 scrape.py [options]

scrape.py - Twitter Scraping Tool

optional arguments:
  -h, --help            show this help message and exit
  -u USERNAME, --username USERNAME
                        Scrape this user\'s Tweets
  --since SINCE         Get Tweets after this date (Example: 2010-01-01).
  --until UNTIL         Get Tweets before this date (Example: 2018-12-07).
  --by BY               Scrape this many days at a time
  --delay DELAY         Time given to load a page before scraping it (seconds)


$ ./scrape.py -u phillipcompeau --by 14 --since 2018-01-01 --until 2019-01-01 
[ scraping user @phillipcompeau... ]
[ 1156 existing tweets in phillipcompeau.json ]
[ searching for tweets... ]
[ found 254 new tweets ]
[ retrieving new tweets (estimated time: 18 seconds)... ]
- batch 1 of 3
- batch 2 of 3
- batch 3 of 3
[ finished scraping ]
[ stored tweets in phillipcompeau.json ]

$ ./scrape.py -u phillipcompeau --since 2019-01-01 --by 7
[ scraping user @phillipcompeau... ]
[ 1410 existing tweets in phillipcompeau.json ]
[ searching for tweets... ]
[ found 541 new tweets ]
[ retrieving new tweets (estimated time: 36 seconds)... ]
- batch 1 of 6
- batch 2 of 6
- batch 3 of 6
- batch 4 of 6
- batch 5 of 6
- batch 6 of 6
[ finished scraping ]
[ stored tweets in phillipcompeau.json ]

$ ./scrape.py -u realwoofy
[ scraping user @realwoofy... ]
[ 149 existing tweets in realwoofy.json ]
[ searching for tweets... ]
[ user has fewer than 3200 tweets, conducting quickscrape... ]
[ found 3 new tweets ]
[ finished scraping ]
[ stored tweets in realwoofy.json ]

Using the Scraper

  • run python3 scrape.py and add the arguments you desire. Try ./scrape.py --help for all options.
    • -u followed by the username [required]
    • --since followed by a date string, e.g., (2017-01-01). Defaults to whenever the user created their twitter
    • --until followed by a date string, e.g., (2018-01-01). Defaults to the current day
    • --by followed by the number of days to scrape at once (default: 7)
      • If someone tweets dozens of times a day, it might be better to use a lower number
    • --delay followed by an integer. This will be the number of seconds to wait on each page load before reading the page
      • if your internet is slow, put this higher (default: 3 seconds)
  • a browser window will pop up and begin scraping
  • when the browser window closes, metadata collection begins for all new tweets
  • when collection finishes, it will dump all the data to a .json file that corresponds to the twitter handle
    • don't worry about running two scrapes that have a time overlap; it will only retrieve new tweets!

Troubleshooting

  • do you get a driver error when you try and execute the script?
    • make sure your browser is up to date and that you have a driver version that matches your browser version
    • you can also open scrape.py and change the driver to use Chrome() or Firefox()
  • does the scraper seem like it's missing tweets that you know should be there?
    • try increasing the --delay parameter, it likely isn't waiting long enough for everything to load
    • try decreasing the --by parameter, it likely has too many tweets showing up on certain days

Twitter API credentials

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].