All Projects → sdelgadoc → download-tweets-ai-text-gen-plus

sdelgadoc / download-tweets-ai-text-gen-plus

Licence: MIT License
Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to download-tweets-ai-text-gen-plus

gnip
Connect to Gnip streaming API and manage rules
Stars: ✭ 28 (+7.69%)
Mutual labels:  twitter, tweets
congresstweets
Datasets of the daily Twitter output of Congress.
Stars: ✭ 76 (+192.31%)
Mutual labels:  twitter, tweets
Real Time Sentiment Tracking On Twitter For Brand Improvement And Trend Recognition
A real-time interactive web app based on data pipelines using streaming Twitter data, automated sentiment analysis, and MySQL&PostgreSQL database (Deployed on Heroku)
Stars: ✭ 127 (+388.46%)
Mutual labels:  twitter, tweets
Twitter Sentiment Analysis
This script can tell you the sentiments of people regarding to any events happening in the world by analyzing tweets related to that event
Stars: ✭ 94 (+261.54%)
Mutual labels:  twitter, tweets
pistoBot
Create an AI that chats like you
Stars: ✭ 121 (+365.38%)
Mutual labels:  text-generation, gpt-2
Tta Elastic
Official Trump Twitter Archive V2 source
Stars: ✭ 104 (+300%)
Mutual labels:  twitter, tweets
Scrape Twitter
🐦 Access Twitter data without an API key. [DEPRECATED]
Stars: ✭ 166 (+538.46%)
Mutual labels:  twitter, tweets
Guffer
Guffer tweets based on a daily schedule
Stars: ✭ 12 (-53.85%)
Mutual labels:  twitter, tweets
Twitterdelete
💀 Delete your old, unpopular tweets.
Stars: ✭ 231 (+788.46%)
Mutual labels:  twitter, tweets
Dmarchiver
A tool to archive the direct messages, images and videos from your private conversations on Twitter
Stars: ✭ 204 (+684.62%)
Mutual labels:  twitter, tweets
Sarcasm Detection
Detecting Sarcasm on Twitter using both traditonal machine learning and deep learning techniques.
Stars: ✭ 73 (+180.77%)
Mutual labels:  twitter, tweets
gpt-j-api
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend
Stars: ✭ 248 (+853.85%)
Mutual labels:  text-generation, gpt
Twitterldatopicmodeling
Uses topic modeling to identify context between follower relationships of Twitter users
Stars: ✭ 48 (+84.62%)
Mutual labels:  twitter, tweets
Twint
An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
Stars: ✭ 12,102 (+46446.15%)
Mutual labels:  twitter, tweets
Twweet Cli
🐦 Tweet right from your cli without even opening your browser.
Stars: ✭ 47 (+80.77%)
Mutual labels:  twitter, tweets
Laravel Twitter Streaming Api
Easily work with the Twitter Streaming API in a Laravel app
Stars: ✭ 153 (+488.46%)
Mutual labels:  twitter, tweets
Twitter Post Fetcher
Fetch your twitter posts without using the new Twitter 1.1 API. Pure JavaScript! By Jason Mayes
Stars: ✭ 886 (+3307.69%)
Mutual labels:  twitter, tweets
Tweets
🐦 Tweet every 24 pull request
Stars: ✭ 8 (-69.23%)
Mutual labels:  twitter, tweets
Download Tweets Ai Text Gen
Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation.
Stars: ✭ 182 (+600%)
Mutual labels:  twitter, text-generation
minGPT-TF
A minimal TF2 re-implementation of the OpenAI GPT training
Stars: ✭ 36 (+38.46%)
Mutual labels:  gpt, gpt-2

download-tweets-ai-text-gen-plus

A small Python 3 script to download public Tweets from Twitter accounts into a format suitable for AI text generation tools (such as gpt-2-simple for finetuning GPT-2).

  • Retrieves all tweets as a simple CSV with a single CLI command
  • Preprocesses tweets to remove URLs, extra spaces, and optionally usertags/hashtags
  • Saves tweets after each collection in case there is an error or you want to end collection early

You can view examples of AI-generated tweets from datasets retrieved with this tool in the /examples folder.

Setup

First, clone this repository onto your system and install dependencies with the following commands:

git clone https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus.git
cd download-tweets-ai-text-gen-plus
pip3 install -r requirements.txt

Previous versions of this code used scraping libraries to collect tweets. Since then, Twitter has made scraping harder while providing more robust tweet collection API's. In response, we ported this code to run only with the Twitter's API.

To continue the setup, create a Twitter app so you can obtain access to the Twitter API. Once you create an app, generate access tokens, and input them into the section of the keys.py file shown below.

keys = {'consumer_key': "",
        'consumer_secret': "",
        'access_token': "",
        'access_token_secret': ""}

Finally, go to the Twitter API's Dev environments page, generate a Dev environment for the Full Archive API, and input the environment's name into label section of the keys.py file shown below.

label = ""

Usage

The script is run via a command line interface. After cding into the directory where the script is stored in a terminal, run:

python3 download_tweets.py <twitter_username> 100

e.g. If you want to download 100 tweets (sans retweets/replies/quote tweets) from Twitter user @santiagodc, run:

python3 download_tweets.py santiagodc 100

NOTE: The Twittter API's free tier has a collection limit of 5,000 tweets per month, so set a tweet limit to avoid hitting your limit too quickly

The script can can also download tweets from multiple usernames at one time. To do so, first create a text file (.txt) with the list of usernames. Then, run script referencing the file name:

python3 download_tweets.py <twitter_usernames_file_name> 100

The tweets will be downloaded to a single-column CSV titled <usernames>_tweets.csv.

The parameters you can pass to the command line interface (positionally or explicitly) are:

  • username: Username of the account whose tweets or .txt file name with multiple usernames you want to download [required]
  • limit: Number of tweets to download [default: all tweets possible]
  • include_replies: Include replies from the user in the dataset [default: False]
  • strip_usertags: Strips out @ user tags in the tweet text [default: False]
  • strip_hashtags: Strips out # hashtags in the tweet text [default: False]
  • sentiment: Adds the specified number of sentiment categories to the output so you can then generate positive/negative tweets changing a parameter [default: 0, possible values: 0, 3, 5, 7]
  • text_format: Specifies the format in which tweets will be returned. The 'simple' format only returns the tweet text. The 'reply' format returns information on preceding tweets to train an AI that can reply to tweets [default: 'simple', possible values: 'simple', 'reply']
  • timeframe: Specifies when to start grabbing tweets from [default: March 22nd, 2006]

How does the sentiment functionality work

The sentiment parameter adds a sentiment category to the tweet text. This information allows the user to train and generate text with different sentiments by changing a parameter.

The output format using the 'simple' text format is the following:

[Sentiment category]
[Tweet text for the tweet that was collected]

The sentiment parameter accepts an integer that specifies the number of sentiment categories that are returned. The sentiment categories for the different possible parameters are the following:

  • 0: No sentiment category is returned
  • 3: POSITIVE, NEUTRAL, NEGATIVE
  • 5: VERY POSITIVE, POSITIVE, NEUTRAL, NEGATIVE, VERY NEGATIVE
  • 7: EXTREMELY POSITIVE, VERY POSITIVE, POSITIVE, NEUTRAL, NEGATIVE, VERY NEGATIVE, EXTREMELY NEGATIVE

How does the text_format functionality work

The code supports collecting tweets in a format for training an AI that can reply to other tweets. The output format is based on the format used to train the Subreddit Simulator Reddit community.

The output format is the following:

****ARGUMENTS
ORIGINAL or REPLY: Whether the tweet is an original tweet or a reply
SENTIMENT: If the sentiment parameter is used, text describing the tweet text's sentiment
****PARENT
[Tweet text for the topmost tweet in a reply thread]
****IN_REPLY_TO
[Tweet text for the tweet that is being responded to]
****TWEET
[Tweet text for the tweet that was collected]

To collect tweets with this reply format by running the following statement:

python3 download_tweets.py <twitter_username> None True False False False 3 reply

How does the timeframe functionality work

By specifying a date, the script will download tweets from the value timeframe to the present. It will by default download every tweet from a given user (or users) starting from the day March 22nd, 2006, the day the first tweet ever was sent. The timeframe parameter is precise, in which it lets you put in a desired year, month, day, hour, and minute to download tweets from, in that order. The format the timeframe parameter accepts looks like YYYYMMDDHHMM.

How to Train an AI on the downloaded tweets

gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|> and <|endoftext|> to each tweet, allowing independent generation of tweets)

You can use this Colaboratory notebook (optimized from the original notebook for this use case) to train the model on your downloaded tweets, and generate massive amounts of Tweets from it. Note that without a lot of data, the model might easily overfit; you may want to train for fewer steps (e.g. 500).

When generating, you'll always need to include certain parameters to decode the tweets, e.g.:

gpt2.generate(sess,
              length=200,
              temperature=0.7,
              prefix='<|startoftext|>',
              truncate='<|endoftext|>',
              include_prefix=False
              )

Helpful Notes

  • You'll need thousands of tweets at minimum to feed to the input model for a good generation results. (ideally 1 MB of input text data, although with tweets that hard to achieve)
  • To help you reach the 1 MB of input text data, you can load data from multiple similar Twitter usernames
  • The download will likely end much earlier than the theoretical limit (inferred from the user profile) as the limit includes retweets/replies/whatever cache shenanigans Twitter is employing.
  • The legalities of distributing downloaded tweets is ambiguous, therefore it's recommended avoiding committing raw Twitter data to GitHub, and is the reason examples of such data is not included in this repo. (AI-generated tweets themselves likely fall under derivative work/parody protected by Fair Use)

Maintainer

Santiago Delgado (@santiagodc) based on download-tweets-ai-text-gen by @minimaxir

License

MIT

Disclaimer

This repo has no affiliation with Twitter Inc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].