All Projects → scrapinghub → arche

scrapinghub / arche

Licence: MIT license
Analyze scraped data

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to arche

Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+287.76%)
Mutual labels:  scrapy
Sourcecodeofbook
《Python爬虫开发 从入门到实战》配套源代码。
Stars: ✭ 226 (+361.22%)
Mutual labels:  scrapy
Awesome crawl
腾讯新闻、知乎话题、微博粉丝,Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等
Stars: ✭ 246 (+402.04%)
Mutual labels:  scrapy
Github Spider
Github 仓库及用户分析爬虫
Stars: ✭ 190 (+287.76%)
Mutual labels:  scrapy
Ruiji.net
crawler framework, distributed crawler extractor
Stars: ✭ 220 (+348.98%)
Mutual labels:  scrapy
Scrapy Splash
Scrapy+Splash for JavaScript integration
Stars: ✭ 2,666 (+5340.82%)
Mutual labels:  scrapy
Scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
Stars: ✭ 2,385 (+4767.35%)
Mutual labels:  scrapy
pagser
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Stars: ✭ 82 (+67.35%)
Mutual labels:  scrapy
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (+348.98%)
Mutual labels:  scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (+377.55%)
Mutual labels:  scrapy
Py Elasticsearch Django
基于python语言开发的千万级别搜索引擎
Stars: ✭ 207 (+322.45%)
Mutual labels:  scrapy
Stealer
抖音、快手、火山、皮皮虾,视频去水印程序
Stars: ✭ 217 (+342.86%)
Mutual labels:  scrapy
Filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
Stars: ✭ 227 (+363.27%)
Mutual labels:  scrapy
News spider
新闻抓取(微信、微博、头条...)
Stars: ✭ 190 (+287.76%)
Mutual labels:  scrapy
estate-crawler
Scraping the real estate agencies for up-to-date house listings as soon as they arrive!
Stars: ✭ 20 (-59.18%)
Mutual labels:  scrapy
Livetv mining
直播网站数据采集
Stars: ✭ 188 (+283.67%)
Mutual labels:  scrapy
Spiderkeeper
admin ui for scrapy/open source scrapinghub
Stars: ✭ 2,562 (+5128.57%)
Mutual labels:  scrapy
lgcrawl
python+scrapy+splash 爬取拉勾全站职位信息
Stars: ✭ 22 (-55.1%)
Mutual labels:  scrapy
domains
World’s single largest Internet domains dataset
Stars: ✭ 461 (+840.82%)
Mutual labels:  scrapy
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (+6171.43%)
Mutual labels:  scrapy

Arche

PyPI PyPI - Python Version GitHub Build Status Codecov Code style: black GitHub commit activity

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

  • Validation with JSON schema
  • Coverage (items, fields, categorical data, including booleans and enums)
  • Duplicates
  • Garbage symbols
  • Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].