Alternatives and detailed information of chirps

Rubium is a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby

Stars: ✭ 65 (+85.71%)

Mutual labels: scraping

ogpParser

Open Graph Protocol Parser for Node.js

Stars: ✭ 43 (+22.86%)

Mutual labels: scraping

Scraper-Projects

🕸 List of mini projects that involve web scraping 🕸

Stars: ✭ 25 (-28.57%)

Mutual labels: scraping

kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…

Stars: ✭ 474 (+1254.29%)

Mutual labels: scraping

ferenda

Transform unstructured document collections to structured Linked Data

Stars: ✭ 22 (-37.14%)

Mutual labels: scraping

document-dl

Command line program to download documents from web portals.

Stars: ✭ 14 (-60%)

Mutual labels: scraping

subscene scraper

Library to download subtitles from subscene.com

Stars: ✭ 14 (-60%)

Mutual labels: scraping

go-scrapy

Web crawling and scraping framework for Golang

Stars: ✭ 17 (-51.43%)

Mutual labels: scraping

web-clipper

Easily download the main content of a web page in html, markdown, and/or epub format from command line.

Stars: ✭ 15 (-57.14%)

Mutual labels: scraping

wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Stars: ✭ 52 (+48.57%)

Mutual labels: scraping

proxi

Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.

Stars: ✭ 32 (-8.57%)

Mutual labels: scraping

AngleParse

HTML parsing and processing tool for PowerShell.

Stars: ✭ 35 (+0%)

Mutual labels: scraping

internet-affordability

🌍 Dataset that shows the Internet affordability by country (a shocking reality!)

Stars: ✭ 13 (-62.86%)

Mutual labels: scraping

scrapy-distributed

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Stars: ✭ 38 (+8.57%)

Mutual labels: scraping

View All Similar Projects ➔

Twitter bot powering www.twitter.com/arichduvet

Uses @sixohsix's Python-based Twitter API for posting and other actions. Scraping done with the help of Kenneth Reitz's requests module and some rudimentary regular expressions.

A poster on this project was presented at PyCon US 2019 by me and at EuroPython 2019 by my friend Parth (because of my absence due to ongoing internship):

Prerequisites

This bot framework is built in Python, so make sure Python 3.x is installed on your system. Once Python is installed, create a virtual environment in the root directory of this repo using the following command:

$ python3 -m venv bot

Then activate this virtual environment using:

$ source bot/bin/activate (for Windows users this can look like workon bot, see the relevant virtualenv documentation for exact usage)

Now install the dependencies using the following command:

$ pip install -r requirements.txt

You will need a PostgreSQL database service ready, a good free service is ElephantSQL. Once you've set up an empty database, save its url (it'll be needed while running init_script below).

For bot deployment, this framework uses Heroku, so you'll also need a Heroku account.

Setting It Up

After creating a new app on Heroku dashboard, install the Heroku CLI on your machine. Then use the following commands to add a new remote to this repository:

$ heroku login
<enter your Heroku credentials>
...

$ heroku git:remote -a <your Heroku app name>

(You can setup a GitHub pipeline on Heroku, but instructions on setting it up are beyond the scope of this README.)

Now create a new branch for this repository, name it "deploy" and check it out:

$ git checkout -b deploy

Remove the chirps/credentials.py and chirps/screen_name.py entries from the .gitignore file. The file should now look like:

[.gitignore]
.DS_Store
.env
bot/
.vscode/
chirps/__pycache__/

Next, run the bot initialization script and enter the required information very carefully:

$ python -m chirps.init_script <your database URL>

The bot setup is essentially complete once this script executes successfully. Now you just need to "tune" certain options of the bot in the Heroku Procfile. Create a file named "Procfile" in the root of this repository and provide the configuration options as per your needs:

[Procfile] # Do not include this in file
worker: python3 -m chirps.main --rate=300 --fav --retweet --follow --follow_limit=6000 --scrape scrape_thenewstack get_tech_news

For example, the above Procfile says "Tweet every 5 minutes (300 seconds), like (favorite) and follow tweets (those tweets that have keywords specified in init_script.py), keep following people tweeting about those keywords till your following count reaches 6000, and use scraper functions scrape_thenewstack() and get_tech_news() for aggregating content to be tweeted by your bot. You can build your own tweet-er functions (usually scrapers) in scrapers.py (they should return or yield strings, which will be tweeted by your bot) and tune other parameters as per your requirements.

Finally, deploy your bot using the following command:

$ git push heroku deploy:master

Once the deployment completes, "switch on" the bot as follows:

$ heroku ps:scale worker=1

Now your bot should be up and running!

If you want to dig deeper into the codebase and know more about the implementation of "generator-of-generators" function in chirps/functions.py, see my tutorial on DigitalOcean which explains that part in detail.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

schedutron / chirps

Programming Languages

Labels

Projects that are alternatives of or similar to chirps

Prerequisites

Setting It Up