All Projects → markowanga → stweet

markowanga / stweet

Licence: MIT license
Advanced python library to scrap Twitter (tweets, users) from unofficial API

Programming Languages

python
139335 projects - #7 most used programming language
julia
2034 projects

Projects that are alternatives of or similar to stweet

diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (-81.53%)
Mutual labels:  scraper, crawl, scrape
Twint
An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
Stars: ✭ 12,102 (+4116.72%)
Mutual labels:  tweets, scrape, twint
twitter-d
TypeScript types for Twitter API objects
Stars: ✭ 54 (-81.18%)
Mutual labels:  twitter-api, user, tweet
TrollHunter
Twitter Troll & Fake News Hunter - Crawls news websites and twitter to identify fake news
Stars: ✭ 38 (-86.76%)
Mutual labels:  scraper, twint
Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (-60.63%)
Mutual labels:  scraper, scrapper
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-81.88%)
Mutual labels:  scraper, crawl
Linkedin scraper
A library that scrapes Linkedin for user data
Stars: ✭ 413 (+43.9%)
Mutual labels:  scraper, users
fiction-dl
A content downloader, capable of retrieving works of (fan)fiction from the web and saving them in a few common file formats.
Stars: ✭ 22 (-92.33%)
Mutual labels:  scraper, scrapper
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+86.76%)
Mutual labels:  scraper, crawl
Scrape Twitter
🐦 Access Twitter data without an API key. [DEPRECATED]
Stars: ✭ 166 (-42.16%)
Mutual labels:  scraper, tweets
Instagram Scraper
scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot
Stars: ✭ 2,209 (+669.69%)
Mutual labels:  scraper, scrape
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-81.53%)
Mutual labels:  scraper, crawl
ha-multiscrape
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
Stars: ✭ 103 (-64.11%)
Mutual labels:  scraper, scrape
scrapman
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
Stars: ✭ 21 (-92.68%)
Mutual labels:  scraper, scrap
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-95.47%)
Mutual labels:  scraper, scrape
GChan
Scrape boards and threads from 4chan (8kun WIP). Downloads images, videos and HTML if desired.
Stars: ✭ 31 (-89.2%)
Mutual labels:  scraper, scrape
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+1320.56%)
Mutual labels:  scraper, scrape
TwitterScraper
Scrape a User's Twitter data! Bypass the 3,200 tweet API limit for a User!
Stars: ✭ 80 (-72.13%)
Mutual labels:  scraper, twitter-api
fansly
Simply scrape / download all the media from an fansly account
Stars: ✭ 351 (+22.3%)
Mutual labels:  scraper, scrape
twpy
Twitter High level scraper for humans.
Stars: ✭ 58 (-79.79%)
Mutual labels:  scraper, twitter-api

stweet

Open Source Love Python package PyPI version MIT Licence

A modern fast python library to scrap tweets and users quickly from Twitter unofficial API.

This tool helps you to scrap tweets by a search phrase, tweets by ids and user by usernames. It uses the Twitter API, the same API is used on a website.

Inspiration for the creation of the library

I have used twint to scrap tweets, but it has many errors, and it doesn't work properly. The code was not simple to understand. All tasks have one config, and the user has to know the exact parameter. The last important thing is the fact that Api can change — Twitter is the API owner and changes depend on it. It is annoying when something does not work and users must report bugs as issues.

Main advantages of the library

  • Simple code — the code is not only mine, every user can contribute to the library
  • Domain objects and interfaces — the main part of functionalities can be replaced (eg. calling web requests), the library has basic simple solution — if you want to expand it, you can do it without any problems and forks
  • 100% coverage with integration tests — this advantage can find the API changes, tests are carried out every week and when the task fails, we can find the source of change easily – not in version 2.0
  • Custom tweets and users output — it is a part of the interface, if you want to save tweets and users custom format, it takes you a brief moment

Installation

pip install -U stweet

Donate

If you want to sponsor me, in thanks for the project, please send me some crypto 😁:

Coin Wallet address
Bitcoin 3EajE9DbLvEmBHLRzjDfG86LyZB4jzsZyg
Etherum 0xE43d8C2c7a9af286bc2fc0568e2812151AF9b1FD

Basic usage

To make a simple request the scrap task must be prepared. The task should be processed by ** runner**.

import stweet as st


def try_search():
    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users]).run()


def try_user_scrap():
    user_task = st.GetUsersTask(['iga_swiatek'])
    output_json = st.JsonLineFileRawOutput('output_raw_user.jl')
    output_print = st.PrintRawOutput()
    st.GetUsersRunner(get_user_task=user_task, raw_data_outputs=[output_print, output_json]).run()


def try_tweet_by_id_scrap():
    id_task = st.TweetsByIdTask('1447348840164564994')
    output_json = st.JsonLineFileRawOutput('output_raw_id.jl')
    output_print = st.PrintRawOutput()
    st.TweetsByIdRunner(tweets_by_id_task=id_task,
                        raw_data_outputs=[output_print, output_json]).run()


if __name__ == '__main__':
    try_search()
    try_user_scrap()
    try_tweet_by_id_scrap()

Example above shows that it is few lines of code required to scrap tweets.

Export format

Stweet uses api from website so there is no documentation about receiving response. Response is saving as raw so final user must parse it on his own. Maybe parser will be added in feature.

Scrapped data can be exported in different ways by using RawDataOutput abstract class. List of these outputs can be passed in every runner – yes it is possible to export in two different ways.

Currently, stweet have implemented:

  • CollectorRawOutput – can save data in memory and return as list of objects
  • JsonLineFileRawOutput – can export data as json lines
  • PrintEveryNRawOutput – prints every N-th item
  • PrintFirstInBatchRawOutput – prints first item in batch
  • PrintRawOutput – prints all items (not recommended in large scrapping)

Using tor proxy

Library is integrated with tor-python-easy. It allows using tor proxy with exposed control port – to change ip when it is needed.

If you want to use tor proxy client you need to prepare custom web client and use it in runner.

You need to run tor proxy -- you can run it on your local OS, or you can use this docker-compose.

Code snippet below show how to use proxy:

import stweet as st

if __name__ == '__main__':
    web_client = st.DefaultTwitterWebClientProvider.get_web_client_preconfigured_for_tor_proxy(
        socks_proxy_url='socks5://localhost:9050',
        control_host='localhost',
        control_port=9051,
        control_password='test1234'
    )

    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users],
                         web_client=web_client).run()

Divide scrap periods recommended

Twitter on guest client block multiple pagination. Sometimes in one query there is possible to call for 3 paginations. To avoid this limitation divide scrapping period for smaller parts.

Twitter in 2023 block in API putting time range in timestamp – only format YYYY-MM-DD is acceptable. In arrow you can only put time without hours.

Twint inspiration

Small part of library uses code from twint. Twint was also main inspiration to create stweet.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].