Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

稳健高效的评分制-针对性- IP代理池 + API服务，可以自己插入采集器进行代理IP的爬取，针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库，支持MongoDB 4.0 使用 Python3.7（Scored IP proxy pool ,customise proxy data crawler can be added anytime）

Stars: ✭ 195 (+31.76%)

Mutual labels: crawler, aiohttp

Crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Stars: ✭ 2,055 (+1288.51%)

Mutual labels: concurrency, crawler

python3-concurrency

Python3爬虫系列的理论验证，首先研究I/O模型，分别用Python实现了blocking I/O、nonblocking I/O、I/O multiplexing各模型下的TCP服务端和客户端。然后，研究同步I/O操作（依序下载、多进程并发、多线程并发）和异步I/O（asyncio）之间的效率差别

Stars: ✭ 49 (-66.89%)

Mutual labels: concurrency, aiohttp

Python3 Concurrency Pics 02

爬取 www.mzitu.com 全站图片，截至目前共5162个图集，16.5万多张美女图片，使用 asyncio 和 aiohttp 实现的异步版本只需要不到2小时就能爬取完成。按日期创建图集目录，保存更合理。控制台只显示下载的进度条，详细信息保存在日志文件中。支持异常处理，不会终止爬虫程序。失败的请求，下次再执行爬虫程序时会自动下载

Stars: ✭ 275 (+85.81%)

Mutual labels: concurrency, aiohttp

snapcrawl

Crawl a website and take screenshots

Stars: ✭ 37 (-75%)

Mutual labels: screenshot, crawler

Ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

Stars: ✭ 1,366 (+822.97%)

Mutual labels: crawler, aiohttp

Crawler China Mainland Universities

中国大陆大学列表爬虫

Stars: ✭ 143 (-3.38%)

Mutual labels: crawler

Tascalate Concurrent

Implementation of blocking (IO-Bound) cancellable java.util.concurrent.CompletionStage and related extensions to java.util.concurrent.ExecutorService-s

Stars: ✭ 144 (-2.7%)

Mutual labels: concurrency

Chymyst Core

Declarative concurrency in Scala - The implementation of the chemical machine

Stars: ✭ 142 (-4.05%)

Mutual labels: concurrency

Python Simple Rest Client

Simple REST client for python 3.6+

Stars: ✭ 143 (-3.38%)

Mutual labels: aiohttp

Asyncninja

A complete set of primitives for concurrency and reactive programming on Swift

Stars: ✭ 146 (-1.35%)

Mutual labels: concurrency

Google Play Scraper

Google play scraper for Python inspired by <facundoolano/google-play-scraper>

Stars: ✭ 143 (-3.38%)

Mutual labels: crawler

Th Music Video Generator

Touhou Project random music video generator/player, crawling image and video from websites to generate MV.

Stars: ✭ 146 (-1.35%)

Mutual labels: crawler

Aioinflux

Asynchronous Python client for InfluxDB

Stars: ✭ 142 (-4.05%)

Mutual labels: aiohttp

Robots Txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

Stars: ✭ 142 (-4.05%)

Mutual labels: crawler

Pachong

一些爬虫的代码

Stars: ✭ 147 (-0.68%)

Mutual labels: crawler

Javpy

Enjoy driving on a Javascriptive (originally Pythonic) way to Japanese AV!

Stars: ✭ 147 (-0.68%)

Mutual labels: crawler

Indonesian Nlp Resources

data resource untuk NLP bahasa indonesia

Stars: ✭ 143 (-3.38%)

Mutual labels: crawler

View All Similar Projects ➔

CoCrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.

The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.5+ using coroutines.

Status

CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.

Screenshot:

Installing

We recommend that you use pyenv, because (1) CoCrawler requires Python 3.5+, and (2) requirements.txt specifies exact module versions.

git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
make init  # will install requirements using pip
make pytest
make test_coverage

Pluggable Modules

Pluggable modules make policy decisions, and use utility routines to keep policy modules short and sweet.

An additional set of pluggable modules provide support for a variety of databases. These databases are mostly used to orchestrate the cooperation of multiple crawl processes, enabling the horizontal scalability of the crawler over many cores and many nodes.

Crawled web assets are intended to be stored as WARC files, although this interface should also pluggable.

Ranking

Everyone knows that ranking is extremely important to search queries, but it's also important to crawling. Crawling the most important stuff is one of the best ways to avoid crawling too much webspam, soft 404s, and crawler trap pages.

SEO is a multi-billion-dollar industry created to game search engine ranking, and any crawl of a wide swath of the web is going to run into poor-quality content attempting to appear to have high quality. There's little chance that CoCrawler's algorithms will beat the most sophisticated SEO techniques, but a little ranking goes a long way.

Credits

CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.

License

Apache 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 148

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗