Alternatives and detailed information of Gain

gaojiuli / Gain

Licence: gpl-3.0

Web crawling framework based on asyncio.

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gain

Owllook

owllook-小说搜索引擎

Stars: ✭ 2,163 (+8.04%)

Mutual labels: asyncio, spider, aiohttp, uvloop

Ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

Stars: ✭ 1,366 (-31.77%)

Mutual labels: asyncio, crawler, spider, aiohttp

Fooproxy

稳健高效的评分制-针对性- IP代理池 + API服务，可以自己插入采集器进行代理IP的爬取，针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库，支持MongoDB 4.0 使用 Python3.7（Scored IP proxy pool ,customise proxy data crawler can be added anytime）

Stars: ✭ 195 (-90.26%)

Mutual labels: asyncio, crawler, spider, aiohttp

Ok ip proxy pool

🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池

Stars: ✭ 196 (-90.21%)

Mutual labels: crawler, spider, aiohttp

yutto

🧊 一个可爱且任性的 B 站视频下载器（bilili V2）

Stars: ✭ 383 (-80.87%)

Mutual labels: spider, aiohttp, asyncio

Js Reverse

JS逆向研究

Stars: ✭ 159 (-92.06%)

Mutual labels: crawler, spider

Go spider

[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Stars: ✭ 1,745 (-12.84%)

Mutual labels: crawler, spider

Amazonbigspider

😱Full Automatic Amazon Distributed Spider | 亚马逊分布式四国际站采集选款产品|账号admin,密码adminadmin

Stars: ✭ 140 (-93.01%)

Mutual labels: crawler, spider

Python Simple Rest Client

Simple REST client for python 3.6+

Stars: ✭ 143 (-92.86%)

Mutual labels: asyncio, aiohttp

Backendschool2019

Приложение для практического руководства по разработке бекенд-сервисов на Python (на основе вступительного испытания в Школу бэкенд‑разработки Яндекса)

Stars: ✭ 129 (-93.56%)

Mutual labels: asyncio, aiohttp

Aioinflux

Asynchronous Python client for InfluxDB

Stars: ✭ 142 (-92.91%)

Mutual labels: asyncio, aiohttp

Cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Stars: ✭ 148 (-92.61%)

Mutual labels: crawler, aiohttp

Pymxget

mxget的Python实现

Stars: ✭ 136 (-93.21%)

Mutual labels: asyncio, aiohttp

Aiohttp

Asynchronous HTTP client/server framework for asyncio and Python

Stars: ✭ 11,972 (+498%)

Mutual labels: asyncio, aiohttp

Yispider

一款分布式爬虫平台，帮助你更好的管理和开发爬虫。内置一套爬虫定义规则（模版），可使用模版快速定义爬虫，也可当作框架手动开发爬虫。(兴趣使然的项目，用的不爽了就更新)

Stars: ✭ 158 (-92.11%)

Mutual labels: crawler, spider

Mm131

MM131网站图片爬取 🚨

Stars: ✭ 129 (-93.56%)

Mutual labels: crawler, spider

Crawler China Mainland Universities

中国大陆大学列表爬虫

Stars: ✭ 143 (-92.86%)

Mutual labels: crawler, spider

Python3 Spider

Python爬虫实战 - 模拟登陆各大网站包含但不限于：滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝，如果喜欢请start ❤️

Stars: ✭ 2,129 (+6.34%)

Mutual labels: crawler, spider

Jlitespider

A lite distributed Java spider framework :-)

Stars: ✭ 151 (-92.46%)

Mutual labels: crawler, spider

Aiozipkin

Distributed tracing instrumentation for asyncio with zipkin

Stars: ✭ 161 (-91.96%)

Mutual labels: asyncio, aiohttp

View All Similar Projects ➔

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

gaojiuli / Gain

Programming Languages

Labels

Projects that are alternatives of or similar to Gain

Requirements

Installation

Usage

Example

Contribution