All Projects → fake-name → Readablewebproxy

fake-name / Readablewebproxy

Licence: bsd-3-clause
Rewriting web proxy and archival tool. At this point, it just tries to download all the things.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Readablewebproxy

Onegram
This repository is no longer maintained.
Stars: ✭ 137 (-20.35%)
Mutual labels:  scraper
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (-13.95%)
Mutual labels:  scraper
Opensanctions
An open database of international sanctions data, persons of interest and politically exposed persons
Stars: ✭ 157 (-8.72%)
Mutual labels:  scraper
Go Jd
京东自动登录,在线商品自动下单
Stars: ✭ 139 (-19.19%)
Mutual labels:  scraper
Google2csv
Google2Csv a simple google scraper that saves the results on a csv/xlsx/jsonl file
Stars: ✭ 145 (-15.7%)
Mutual labels:  scraper
Serpscrap
SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.
Stars: ✭ 153 (-11.05%)
Mutual labels:  scraper
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+6612.21%)
Mutual labels:  scraper
Scrape Twitter
🐦 Access Twitter data without an API key. [DEPRECATED]
Stars: ✭ 166 (-3.49%)
Mutual labels:  scraper
Scraperwiki Python
ScraperWiki Python library for scraping and saving data
Stars: ✭ 146 (-15.12%)
Mutual labels:  scraper
Covid19 mobility
COVID-19 Mobility Data Aggregator. Scraper of Google, Apple, Waze and TomTom COVID-19 Mobility Reports🚶🚘🚉
Stars: ✭ 156 (-9.3%)
Mutual labels:  scraper
Zillow
Zillow Scraper for Python using Selenium
Stars: ✭ 141 (-18.02%)
Mutual labels:  scraper
Youtube Projects
This repository contains all the code I use in my YouTube tutorials.
Stars: ✭ 144 (-16.28%)
Mutual labels:  scraper
Demeter
Demeter is a tool for scraping the calibre web ui
Stars: ✭ 155 (-9.88%)
Mutual labels:  scraper
Bandcamp Scraper
A scraper for https://bandcamp.com
Stars: ✭ 137 (-20.35%)
Mutual labels:  scraper
Datmusic Api
Alternative for VK Audio API
Stars: ✭ 160 (-6.98%)
Mutual labels:  scraper
Udemycoursegrabber
Your will to enroll in Udemy course is here, but the money isn't? Search no more! This python program searches for your desired course in more than [insert big number here] websites, compares the last updated date, and gives you the download link of the latest one back, but you also have the choice to see the other ones as well!
Stars: ✭ 137 (-20.35%)
Mutual labels:  scraper
Nooverviewavailable.com
A survey of Apple developer documentation.
Stars: ✭ 152 (-11.63%)
Mutual labels:  scraper
Novel
基于 Laravel 5.2 的小说网站
Stars: ✭ 172 (+0%)
Mutual labels:  scraper
Scrapelib
⛏ a library for scraping things
Stars: ✭ 164 (-4.65%)
Mutual labels:  scraper
Instagram Scraper
scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot
Stars: ✭ 2,209 (+1184.3%)
Mutual labels:  scraper

Readable-Web Proxy

Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.

This is a rewriting proxy. In other words, it proxies arbitrary web content, while allowing the rewriting of the remote content as driven by a set of rule-files. The goal is to effectively allow the complete customization of any existing web-sites as driven by predefined rules.

Functionally, it's used for extracting just the actual content body of a site and reproducing it in a clean layout. It also modifies all links on the page to point to internal addresses, so following a link points to the proxied version of the file, rather then the original.


While the above was the original scope, the project has mutated heavily. At this point, it has a complete web spider and archives entire websites to local storage. Additionally, multiple versions of each page are kept, with a overall rolling refresh of the entire database at configurable intervals (configurable on a per-domain, or global basis).

There are also a lot of facilities responsible for feeding the releases/RSS views as part of wlnupdates.com.


Quick installation overview:

  • Install Redis
  • (optional) install InfluxDB
  • (optional) install Graphite
  • Install Postgresql >= 10.
  • Build the community extensions for Postgresql.
  • Create a database for the project.
  • In the project database, install the pg_trgm and citext extensions from the community extensions modules.
  • Copy settings.example.py to settings.py.
  • Fill in all settings in settings.py
  • Setup virtualhost by running build-venv.sh
  • Activate vhost: source flask/bin/activate
  • Bootstrap DB: alembic uprade head
  • (on another machine/session) Run local fetch RPC server run_local.sh from https://github.com/fake-name/AutoTriever
  • Run server: python3 run.py
  • If you want to run the spider, it has a LOT more complicated components:
    • Main scraper is started by python runScrape.py
    • Raw scraper is started by python runScrape.py raw
    • Scraper periodic scheduler is started by python runScrape.py scheduler
    • The scraper requires substantial RPC infrastructure. You will need:
      • A RabbitMQ instance with a public DNS address
      • A machine running saltstack + salt-master with a public DNS address On the salt machine, run https://github.com/fake-name/AutoTriever/tree/master/marshaller/salt_scheduler.py
      • A variable number of RPC workers to execute fetch tasks. The AutoTriever project can be used to manage these.
      • A machine to run the RPC local demultiplexing agent (run_agent.sh) The RPC agent allows multiple projects to use the RPC system simultaneously. Since the RPC system basically allows executing either predefined jobs, or arbitrary code on the worker swarm. This is fairly useful in general, so I've implemented it as a service that multiple of my projects then use.

Ubuntu dependencies

  • postgresql-common libpq-dev libenchant-dev
  • probably more I've forgotten
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].