All Projects → grafted-in → web-scraping-engine

grafted-in / web-scraping-engine

Licence: BSD-3-Clause license
A simple web scraping engine supporting concurrent and anonymous scraping

Programming Languages

haskell
3896 projects
Nix
1067 projects
shell
77523 projects

Projects that are alternatives of or similar to web-scraping-engine

Ultra Runner
🏃⛰ Ultra fast monorepo script runner and build tool
Stars: ✭ 496 (+1737.04%)
Mutual labels:  parallel, concurrent
Suman
🌇 🌆 🌉 Advanced, user-friendly, language-agnostic, super-high-performance test runner. http://sumanjs.org
Stars: ✭ 57 (+111.11%)
Mutual labels:  parallel, concurrent
Cloe
Cloe programming language
Stars: ✭ 398 (+1374.07%)
Mutual labels:  parallel, concurrent
node-bogota
🚀 Run tape tests concurrently with tap-spec output
Stars: ✭ 15 (-44.44%)
Mutual labels:  parallel, concurrent
pareach
a tiny function that "parallelizes" work in NodeJS
Stars: ✭ 19 (-29.63%)
Mutual labels:  parallel, concurrent
Rubico
[a]synchronous functional programming
Stars: ✭ 133 (+392.59%)
Mutual labels:  parallel, concurrent
Hamsters.js
100% Vanilla Javascript Multithreading & Parallel Execution Library
Stars: ✭ 517 (+1814.81%)
Mutual labels:  parallel, concurrent
PTTmineR
Parallel Searching and Crawling Data from PTT 🚀
Stars: ✭ 31 (+14.81%)
Mutual labels:  scraper, parallel
Util
A collection of useful utility functions
Stars: ✭ 201 (+644.44%)
Mutual labels:  parallel, concurrent
Pytest Parallel
A pytest plugin for parallel and concurrent testing
Stars: ✭ 146 (+440.74%)
Mutual labels:  parallel, concurrent
YACLib
Yet Another Concurrency Library
Stars: ✭ 193 (+614.81%)
Mutual labels:  parallel, concurrent
java-multithread
Códigos feitos para o curso de Multithreading com Java, no canal RinaldoDev do YouTube.
Stars: ✭ 24 (-11.11%)
Mutual labels:  parallel, concurrent
impartus-downloader
Download Impartus lectures, convert to mkv for offline viewing.
Stars: ✭ 19 (-29.63%)
Mutual labels:  scraper
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+196.3%)
Mutual labels:  scraper
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-44.44%)
Mutual labels:  scraper
stock-market-scraper
Scraps historical stock market data from Yahoo Finance (https://finance.yahoo.com/)
Stars: ✭ 110 (+307.41%)
Mutual labels:  scraper
tieba-zhuaqu
百度贴吧分布式爬虫,用于贴吧数据挖掘。从贴吧维度和用户维度进行数据分析
Stars: ✭ 56 (+107.41%)
Mutual labels:  scraper
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-48.15%)
Mutual labels:  scraper
FoldsCUDA.jl
Data-parallelism on CUDA using Transducers.jl and for loops (FLoops.jl)
Stars: ✭ 48 (+77.78%)
Mutual labels:  parallel
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-51.85%)
Mutual labels:  scraper

Web Scraping Engine

Usage

To run:

stack exec example --cache-dir cache -a user-agents.txt -o output.csv

During testing/development, you can run the scraper from within GHCI:

  • cd example
  • stack ghci
  • mainTest "--cache-dir cache --cache-only -a user-agents.txt -o output.csv"

To run the scraper with anonymization:

  1. cd example
  2. bash build-proxies.sh > torrc-file
  3. tor -f torrc-file & (wait until logs report success)
  4. stack exec example -- --cache-dir cache -a user-agents.txt --torrc torrc-file o outdata.csv -m 8111 +RTS -N15 where * 8111 is the port to an EKG monitor on localhost * -N15 is how many cores to use
  5. After a long time you will need to kill the process manually.

Development

Develop with one of:

  • stack ghci
  • nix-shell --run 'cabal repl'

Build with one of:

  • stack build
  • nix-shell --run 'cabal build'
  • nix-build
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].