All Projects → cocrawler → Cocrawler

cocrawler / Cocrawler

Licence: apache-2.0
CoCrawler is a versatile web crawler built using modern tools and concurrency.

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Cocrawler

Ok ip proxy pool
🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池
Stars: ✭ 196 (+32.43%)
Mutual labels:  crawler, aiohttp
Gain
Web crawling framework based on asyncio.
Stars: ✭ 2,002 (+1252.7%)
Mutual labels:  crawler, aiohttp
Fooproxy
稳健高效的评分制-针对性- IP代理池 + API服务,可以自己插入采集器进行代理IP的爬取,针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库,支持MongoDB 4.0 使用 Python3.7(Scored IP proxy pool ,customise proxy data crawler can be added anytime)
Stars: ✭ 195 (+31.76%)
Mutual labels:  crawler, aiohttp
Crawler
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
Stars: ✭ 2,055 (+1288.51%)
Mutual labels:  concurrency, crawler
python3-concurrency
Python3爬虫系列的理论验证,首先研究I/O模型,分别用Python实现了blocking I/O、nonblocking I/O、I/O multiplexing各模型下的TCP服务端和客户端。然后,研究同步I/O操作(依序下载、多进程并发、多线程并发)和异步I/O(asyncio)之间的效率差别
Stars: ✭ 49 (-66.89%)
Mutual labels:  concurrency, aiohttp
Python3 Concurrency Pics 02
爬取 www.mzitu.com 全站图片,截至目前共5162个图集,16.5万多张美女图片,使用 asyncio 和 aiohttp 实现的异步版本只需要不到2小时就能爬取完成。按日期创建图集目录,保存更合理。控制台只显示下载的进度条,详细信息保存在日志文件中。支持异常处理,不会终止爬虫程序。失败的请求,下次再执行爬虫程序时会自动下载
Stars: ✭ 275 (+85.81%)
Mutual labels:  concurrency, aiohttp
snapcrawl
Crawl a website and take screenshots
Stars: ✭ 37 (-75%)
Mutual labels:  screenshot, crawler
Ruia
Async Python 3.6+ web scraping micro-framework based on asyncio
Stars: ✭ 1,366 (+822.97%)
Mutual labels:  crawler, aiohttp
Crawler China Mainland Universities
中国大陆大学列表爬虫
Stars: ✭ 143 (-3.38%)
Mutual labels:  crawler
Tascalate Concurrent
Implementation of blocking (IO-Bound) cancellable java.util.concurrent.CompletionStage and related extensions to java.util.concurrent.ExecutorService-s
Stars: ✭ 144 (-2.7%)
Mutual labels:  concurrency
Chymyst Core
Declarative concurrency in Scala - The implementation of the chemical machine
Stars: ✭ 142 (-4.05%)
Mutual labels:  concurrency
Python Simple Rest Client
Simple REST client for python 3.6+
Stars: ✭ 143 (-3.38%)
Mutual labels:  aiohttp
Asyncninja
A complete set of primitives for concurrency and reactive programming on Swift
Stars: ✭ 146 (-1.35%)
Mutual labels:  concurrency
Google Play Scraper
Google play scraper for Python inspired by <facundoolano/google-play-scraper>
Stars: ✭ 143 (-3.38%)
Mutual labels:  crawler
Th Music Video Generator
Touhou Project random music video generator/player, crawling image and video from websites to generate MV.
Stars: ✭ 146 (-1.35%)
Mutual labels:  crawler
Aioinflux
Asynchronous Python client for InfluxDB
Stars: ✭ 142 (-4.05%)
Mutual labels:  aiohttp
Robots Txt
Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
Stars: ✭ 142 (-4.05%)
Mutual labels:  crawler
Pachong
一些爬虫的代码
Stars: ✭ 147 (-0.68%)
Mutual labels:  crawler
Javpy
Enjoy driving on a Javascriptive (originally Pythonic) way to Japanese AV!
Stars: ✭ 147 (-0.68%)
Mutual labels:  crawler
Indonesian Nlp Resources
data resource untuk NLP bahasa indonesia
Stars: ✭ 143 (-3.38%)
Mutual labels:  crawler

CoCrawler

Build Status Coverage Status Apache License 2.0

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.

The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.5+ using coroutines.

Status

CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.

Screenshot: Screenshot

Installing

We recommend that you use pyenv, because (1) CoCrawler requires Python 3.5+, and (2) requirements.txt specifies exact module versions.

git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
make init  # will install requirements using pip
make pytest
make test_coverage

Pluggable Modules

Pluggable modules make policy decisions, and use utility routines to keep policy modules short and sweet.

An additional set of pluggable modules provide support for a variety of databases. These databases are mostly used to orchestrate the cooperation of multiple crawl processes, enabling the horizontal scalability of the crawler over many cores and many nodes.

Crawled web assets are intended to be stored as WARC files, although this interface should also pluggable.

Ranking

Everyone knows that ranking is extremely important to search queries, but it's also important to crawling. Crawling the most important stuff is one of the best ways to avoid crawling too much webspam, soft 404s, and crawler trap pages.

SEO is a multi-billion-dollar industry created to game search engine ranking, and any crawl of a wide swath of the web is going to run into poor-quality content attempting to appear to have high quality. There's little chance that CoCrawler's algorithms will beat the most sophisticated SEO techniques, but a little ranking goes a long way.

Credits

CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.

License

Apache 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].