scrapinghub / arche

Licence: MIT license

Analyze scraped data

Programming Languages

python

139335 projects - #7 most used programming language

HTML

75241 projects

Projects that are alternatives of or similar to arche

Goribot

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Stars: ✭ 190 (+287.76%)

Mutual labels: scrapy

Sourcecodeofbook

《Python爬虫开发从入门到实战》配套源代码。

Stars: ✭ 226 (+361.22%)

Mutual labels: scrapy

Awesome crawl

腾讯新闻、知乎话题、微博粉丝，Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等

Stars: ✭ 246 (+402.04%)

Mutual labels: scrapy

Github Spider

Github 仓库及用户分析爬虫

Stars: ✭ 190 (+287.76%)

Mutual labels: scrapy

Ruiji.net

crawler framework, distributed crawler extractor

Stars: ✭ 220 (+348.98%)

Mutual labels: scrapy

Scrapy Splash

Scrapy+Splash for JavaScript integration

Stars: ✭ 2,666 (+5340.82%)

Mutual labels: scrapy

Scrapydweb

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉

Stars: ✭ 2,385 (+4767.35%)

Mutual labels: scrapy

pagser

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Stars: ✭ 82 (+67.35%)

Mutual labels: scrapy

City Scrapers

Scrape, standardize and share public meetings from local government websites

Stars: ✭ 220 (+348.98%)

Mutual labels: scrapy

Spider job

招聘网数据爬虫

Stars: ✭ 234 (+377.55%)

Mutual labels: scrapy

Py Elasticsearch Django

基于python语言开发的千万级别搜索引擎

Stars: ✭ 207 (+322.45%)

Mutual labels: scrapy

Stealer

抖音、快手、火山、皮皮虾，视频去水印程序

Stars: ✭ 217 (+342.86%)

Mutual labels: scrapy

Filesensor

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具

Stars: ✭ 227 (+363.27%)

Mutual labels: scrapy

News spider

新闻抓取（微信、微博、头条...）

Stars: ✭ 190 (+287.76%)

Mutual labels: scrapy

estate-crawler

Scraping the real estate agencies for up-to-date house listings as soon as they arrive!

Stars: ✭ 20 (-59.18%)

Mutual labels: scrapy

Livetv mining

直播网站数据采集

Stars: ✭ 188 (+283.67%)

Mutual labels: scrapy

Spiderkeeper

admin ui for scrapy/open source scrapinghub

Stars: ✭ 2,562 (+5128.57%)

Mutual labels: scrapy

lgcrawl

python+scrapy+splash 爬取拉勾全站职位信息

Stars: ✭ 22 (-55.1%)

Mutual labels: scrapy

domains

World’s single largest Internet domains dataset

Stars: ✭ 461 (+840.82%)

Mutual labels: scrapy

Ecommercecrawlers

码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com

Stars: ✭ 3,073 (+6171.43%)

Mutual labels: scrapy

View All Similar Projects ➔

Arche

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

Validation with JSON schema
Coverage (items, fields, categorical data, including booleans and enums)
Duplicates
Garbage symbols
Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

scrapinghub / arche

Programming Languages

Labels

Projects that are alternatives of or similar to arche

Arche

Installation

Why

Developer Setup

Contribution