All Projects → tokahuke → lopez

tokahuke / lopez

Licence: other
Crawling and scraping the Web for fun and profit

Programming Languages

rust
11053 projects
PLpgSQL
1095 projects
shell
77523 projects

Projects that are alternatives of or similar to lopez

papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-25%)
Mutual labels:  scraper, web-scraping
Serp
Google Search SERP Scraper
Stars: ✭ 40 (+100%)
Mutual labels:  scraper, seo
SearchScraperAPI
Aiohttp web server API, which scrapes Google and returns scrape results as response. Supports proxies, multiple geos and number of results.
Stars: ✭ 31 (+55%)
Mutual labels:  scraper, seo
site-audit-seo
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.
Stars: ✭ 91 (+355%)
Mutual labels:  scraper, seo
Zillow
Zillow Scraper for Python using Selenium
Stars: ✭ 141 (+605%)
Mutual labels:  scraper, web-scraping
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-25%)
Mutual labels:  scraper, web-scraping
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+3180%)
Mutual labels:  scraper, web-scraping
saveddit
Bulk Downloader for Reddit
Stars: ✭ 130 (+550%)
Mutual labels:  scraper, web-scraping
Rod
A Devtools driver for web automation and scraping
Stars: ✭ 1,392 (+6860%)
Mutual labels:  scraper, web-scraping
Sillynium
Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements
Stars: ✭ 100 (+400%)
Mutual labels:  scraper, web-scraping
Linkedin-Client
Web scraper for grabing data from Linkedin profiles or company pages (personal project)
Stars: ✭ 42 (+110%)
Mutual labels:  scraper, web-scraping
Serpscrap
SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.
Stars: ✭ 153 (+665%)
Mutual labels:  scraper, seo
rymscraper
Python API to extract data from rateyourmusic.com.
Stars: ✭ 63 (+215%)
Mutual labels:  scraper, web-scraping
sp-subway-scraper
🚆This web scraper builds a dataset for São Paulo subway operation status
Stars: ✭ 24 (+20%)
Mutual labels:  scraper, web-scraping
TikTokDownloader PyWebIO
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具,支持API调用,在线批量解析及下载。
Stars: ✭ 919 (+4495%)
Mutual labels:  scraper, web-scraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+20285%)
Mutual labels:  scraper, web-scraping
BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+240%)
Mutual labels:  scraper, web-scraping
Hockey Scraper
Python Package for scraping NHL Play-by-Play and Shift data
Stars: ✭ 93 (+365%)
Mutual labels:  scraper, web-scraping
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (+640%)
Mutual labels:  scraper, web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+1095%)
Mutual labels:  scraper, web-scraping

Welcome to "the Lopez"

Affero GPLv3 license Join our Discord chat Github All Downloads GitHub release

Crawling and scraping the Web for fun and profit.

A word of caution

There is a very tenuous line between a crawl and a DoS attack. Please, be mindful of the crawling speed you inflict on websites! For your convenience, crawling is limited by default to 2.5 hits per second per origin, which is a good default. You can override this value using the set max_hits_per_sec directive in your configuration, but make sure that you will not overload the server (or that you have the permission to do so). Remember: some people's livelihoods depend on these websites and not every site has good DoS mitigation.

Also, some people may get angry that you are scraping their website and may start annoying you because of that. If they are crazy enough or money is involved, they may even try to prosecute you. And the judicial system is just crazy nowadays, so who knows?

In either case, I have nothing to do with that. Use this program at your own risk.

Installing the damn thing

If you are feeling particularly lazy today, just copy and paste the following in your favorite command line (Unix-like only):

curl -L "https://github.com/tokahuke/lopez/releases/latest/download/entalator" \
    > /tmp/entalator
chmod +x /tmp/entalator
sudo /tmp/entalator &&
sudo cp /tmp/entalator /usr/share/lopez

You will get the latest Lopez experience, which is lopez installed for all users in your computer with full access to lopez-std out of the box. If you ever wish to get rid of the installation, just use the following one-liner:

sudo /usr/share/lopez/entalator --uninstall

but remember there is no turning back.

This method should work on any Unix-based system; there is an open issue for porting it to Windows. However, with a bit more of setup, you can run lopez on most architectures. Compiling from the source code in the repository using Cargo (the Rust package manager) should be quite simple.

Running the damn thing

If you installed from the entalator, you will have the binary lopez available globally on your machine. To get started, run

lopez --help

to get a friendly help dialog. This will list your options while running Lopez. To really get started running lopez, see our Quickstart guide.

Lopez Crawl Directives

You will need a Crawl Directives file to run the crawl. This file describes what you want to scrape from web pages as well as were and how you want to crawl the Web. For more information on the syntax and semantics, see this link. Either way, here is a nice example (yes, syntax highlighting is supported for VSCode!):

Sample code example for Lopez Crawl Directives

Backends

Lopez supports the idea of backends, which is where the data comes from and goes to. The implementation is completely generic, so you may write your own if you so wish. For now, lopez ships with a nice PostgreSQL backend for your convenience. Support for other popular databases (and unpopular ones as well) is greatly appreciated.

For more information on backends, see the documentation for the lib_lopez::backend module.

Minimum Rust Version

By now, Lopez only compiles on Rust Nightly. Unfortunately, we are waiting on the following features to be stabilized:

Good news: stabilization is due in a few days!

Features

Let's brag a little!

  • The beast is fast, in comparison with other similar programs I have made in the past using the Python ecosystem (BeatufulSoup, asyncio, etc...). It's in Rust; what were you expecting?

  • It uses very little memory. If crawling is not done correctly, it can gobble up your memory and still ask for more. Using a database (PostgreSQL), all evil is averted!

  • It is polite. Yes, it obeys robots.txt and no, you can't turn that off.

Limitations and future plans

  • Lopez is still limited to a single machine. No distributed programming yet. However, what are you scraping that requires so much firepower?

  • No JavaScript execution. This is pretty standard, since JavaScript is heavy to run. If it is ever to be supported, it should be opt-in.

  • This crate need more docs and more support for other backends. Sorry, I have a full-time job.

  • See the open issues for more scary (and interesting) stuff.

Licensing

All the work in this repository is licensed under the Affero GPLv3 (aka AGPL) license. See the license file for mode detailed information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].