tokahuke / lopez

Licence: other

Crawling and scraping the Web for fun and profit

Programming Languages

rust

11053 projects

PLpgSQL

1095 projects

shell

77523 projects

Projects that are alternatives of or similar to lopez

papercut

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-25%)

Mutual labels: scraper, web-scraping

Serp

Google Search SERP Scraper

Stars: ✭ 40 (+100%)

Mutual labels: scraper, seo

SearchScraperAPI

Aiohttp web server API, which scrapes Google and returns scrape results as response. Supports proxies, multiple geos and number of results.

Stars: ✭ 31 (+55%)

Mutual labels: scraper, seo

site-audit-seo

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.

Stars: ✭ 91 (+355%)

Mutual labels: scraper, seo

Zillow

Zillow Scraper for Python using Selenium

Stars: ✭ 141 (+605%)

Mutual labels: scraper, web-scraping

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-25%)

Mutual labels: scraper, web-scraping

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+3180%)

Mutual labels: scraper, web-scraping

saveddit

Bulk Downloader for Reddit

Stars: ✭ 130 (+550%)

Mutual labels: scraper, web-scraping

Rod

A Devtools driver for web automation and scraping

Stars: ✭ 1,392 (+6860%)

Mutual labels: scraper, web-scraping

Sillynium

Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements

Stars: ✭ 100 (+400%)

Mutual labels: scraper, web-scraping

Linkedin-Client

Web scraper for grabing data from Linkedin profiles or company pages (personal project)

Stars: ✭ 42 (+110%)

Mutual labels: scraper, web-scraping

Serpscrap

SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.

Stars: ✭ 153 (+665%)

Mutual labels: scraper, seo

rymscraper

Python API to extract data from rateyourmusic.com.

Stars: ✭ 63 (+215%)

Mutual labels: scraper, web-scraping

sp-subway-scraper

🚆This web scraper builds a dataset for São Paulo subway operation status

Stars: ✭ 24 (+20%)

Mutual labels: scraper, web-scraping

TikTokDownloader PyWebIO

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具，支持API调用，在线批量解析及下载。

Stars: ✭ 919 (+4495%)

Mutual labels: scraper, web-scraping

Autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Stars: ✭ 4,077 (+20285%)

Mutual labels: scraper, web-scraping

BookingScraper

🌎 🏨 Scrape Booking.com 🏨 🌎

Stars: ✭ 68 (+240%)

Mutual labels: scraper, web-scraping

Hockey Scraper

Python Package for scraping NHL Play-by-Play and Shift data

Stars: ✭ 93 (+365%)

Mutual labels: scraper, web-scraping

Phpscraper

PHP Scraper - an highly opinionated web-interface for PHP

Stars: ✭ 148 (+640%)

Mutual labels: scraper, web-scraping

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (+1095%)

Mutual labels: scraper, web-scraping

View All Similar Projects ➔

Welcome to "the Lopez"

Crawling and scraping the Web for fun and profit.

A word of caution

There is a very tenuous line between a crawl and a DoS attack. Please, be mindful of the crawling speed you inflict on websites! For your convenience, crawling is limited by default to 2.5 hits per second per origin, which is a good default. You can override this value using the set max_hits_per_sec directive in your configuration, but make sure that you will not overload the server (or that you have the permission to do so). Remember: some people's livelihoods depend on these websites and not every site has good DoS mitigation.

Also, some people may get angry that you are scraping their website and may start annoying you because of that. If they are crazy enough or money is involved, they may even try to prosecute you. And the judicial system is just crazy nowadays, so who knows?

In either case, I have nothing to do with that. Use this program at your own risk.

Installing the damn thing

If you are feeling particularly lazy today, just copy and paste the following in your favorite command line (Unix-like only):

curl -L "https://github.com/tokahuke/lopez/releases/latest/download/entalator" \
    > /tmp/entalator
chmod +x /tmp/entalator
sudo /tmp/entalator &&
sudo cp /tmp/entalator /usr/share/lopez

You will get the latest Lopez experience, which is lopez installed for all users in your computer with full access to lopez-std out of the box. If you ever wish to get rid of the installation, just use the following one-liner:

sudo /usr/share/lopez/entalator --uninstall

but remember there is no turning back.

This method should work on any Unix-based system; there is an open issue for porting it to Windows. However, with a bit more of setup, you can run lopez on most architectures. Compiling from the source code in the repository using Cargo (the Rust package manager) should be quite simple.

Running the damn thing

If you installed from the entalator, you will have the binary lopez available globally on your machine. To get started, run

lopez --help

to get a friendly help dialog. This will list your options while running Lopez. To really get started running lopez, see our Quickstart guide.

Lopez Crawl Directives

You will need a Crawl Directives file to run the crawl. This file describes what you want to scrape from web pages as well as were and how you want to crawl the Web. For more information on the syntax and semantics, see this link. Either way, here is a nice example (yes, syntax highlighting is supported for VSCode!):

Backends

Lopez supports the idea of backends, which is where the data comes from and goes to. The implementation is completely generic, so you may write your own if you so wish. For now, lopez ships with a nice PostgreSQL backend for your convenience. Support for other popular databases (and unpopular ones as well) is greatly appreciated.

For more information on backends, see the documentation for the lib_lopez::backend module.

Minimum Rust Version

By now, Lopez only compiles on Rust Nightly. Unfortunately, we are waiting on the following features to be stabilized:

move_ref_pattern: rust-lang/rust#68354

Good news: stabilization is due in a few days!

Features

Let's brag a little!

The beast is fast, in comparison with other similar programs I have made in the past using the Python ecosystem (BeatufulSoup, asyncio, etc...). It's in Rust; what were you expecting?
It uses very little memory. If crawling is not done correctly, it can gobble up your memory and still ask for more. Using a database (PostgreSQL), all evil is averted!
It is polite. Yes, it obeys robots.txt and no, you can't turn that off.

Limitations and future plans

Lopez is still limited to a single machine. No distributed programming yet. However, what are you scraping that requires so much firepower?
No JavaScript execution. This is pretty standard, since JavaScript is heavy to run. If it is ever to be supported, it should be opt-in.
This crate need more docs and more support for other backends. Sorry, I have a full-time job.
See the open issues for more scary (and interesting) stuff.

Licensing

All the work in this repository is licensed under the Affero GPLv3 (aka AGPL) license. See the license file for mode detailed information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tokahuke / lopez

Programming Languages

Labels

Projects that are alternatives of or similar to lopez

Welcome to "the Lopez"

A word of caution

Installing the damn thing

Running the damn thing

Lopez Crawl Directives

Backends

Minimum Rust Version

Features

Limitations and future plans

Licensing