All Categories → No Category → webcrawling

Top 9 webcrawling open source projects

Heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

✭ 2,104

java javascript HTML Rich Text Format FreeMarker PostScript warc heritrix webcrawling

url-frontier

API definition, resources and reference implementation of URL Frontiers

✭ 16

java Dockerfile grpc webcrawling web-crawlers url-frontier

Stock-Fundamental-data-scraping-and-analysis

Project on building a web crawler to collect the fundamentals of the stock and review their performance in one go

✭ 40

Jupyter Notebook automation selenium web-scraping webcrawling datacollection stock-fundamentalplots

ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9

✭ 68

python Jupyter Notebook Batchfile scraping crawling scrapy webscraping scrapyd webcrawling

newspaperjs

News extraction and scraping. Article Parsing

✭ 59

HTML javascript nodejs crawler scraper news news-aggregator webscraping webcrawling

gotor

This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.

✭ 97