Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

稳健高效的评分制-针对性- IP代理池 + API服务，可以自己插入采集器进行代理IP的爬取，针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库，支持MongoDB 4.0 使用 Python3.7（Scored IP proxy pool ,customise proxy data crawler can be added anytime）

Stars: ✭ 195 (-0.51%)

Mutual labels: async, crawler, spider, aiohttp, proxypool

Spoon

🥄 A package for building specific Proxy Pool for different Sites.

Stars: ✭ 173 (-11.73%)

Mutual labels: crawler, spider, proxy, ip, proxypool

Free proxy website

获取免费socks/https/http代理的网站集合

Stars: ✭ 119 (-39.29%)

Mutual labels: crawler, spider, proxy, ip

Proxybroker

Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS 🎭

Stars: ✭ 2,767 (+1311.73%)

Mutual labels: crawler, proxy, proxypool

Gain

Web crawling framework based on asyncio.

Stars: ✭ 2,002 (+921.43%)

Mutual labels: crawler, spider, aiohttp

Ppspider

web spider built by puppeteer, support task-queue and task-scheduling by decorators，support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架，提供灵活的任务队列管理调度方案，提供便捷的数据保存方案（nedb/mongodb），提供数据可视化和用户交互的实现方案

Stars: ✭ 237 (+20.92%)

Mutual labels: crawler, spider, proxy

Proxypool

An Efficient ProxyPool with Getter, Tester and Server

Stars: ✭ 3,050 (+1456.12%)

Mutual labels: flask, proxy, proxypool

Weixin Spider

微信公众号爬虫，公众号历史文章，文章评论，文章阅读及在看数据，可视化web页面，可部署于Windows服务器。基于Python3之flask/mysql/redis/mitmproxy/pywin32等实现，高效微信爬虫，微信公众号爬虫，历史文章，文章评论，数据更新。

Stars: ✭ 287 (+46.43%)

Mutual labels: crawler, spider, flask

Toapi

Every web site provides APIs.

Stars: ✭ 3,209 (+1537.24%)

Mutual labels: crawler, spider, flask

Ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

Stars: ✭ 1,366 (+596.94%)

Mutual labels: crawler, spider, aiohttp

Awesome Python Primer

自学入门 Python 优质中文资源索引，包含书籍 / 文档 / 视频，适用于爬虫 / Web / 数据分析 / 机器学习方向

Stars: ✭ 57 (-70.92%)

Mutual labels: crawler, spider, flask

Marmot

💐Marmot | Web Crawler/HTTP protocol Download Package 🐭

Stars: ✭ 186 (-5.1%)

Mutual labels: crawler, spider, proxy

Nodespider

[DEPRECATED] Simple, flexible, delightful web crawler/spider package

Stars: ✭ 33 (-83.16%)

Mutual labels: async, crawler, spider

Jianso movie

🎬 电影资源爬虫,电影图片抓取脚本,Flask|Nginx|wsgi

Stars: ✭ 114 (-41.84%)

Mutual labels: crawler, sqlite, flask

Fp Server

Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器，基于Tornado和Scrapy，在本地搭建属于自己的代理池

Stars: ✭ 154 (-21.43%)

Mutual labels: spider, proxy, proxypool

Fun crawler

Crawl some picture for fun

Stars: ✭ 169 (-13.78%)

Mutual labels: crawler, spider

Proxypool

高质量免费代理池——每日1w+代理资源滚动更新

Stars: ✭ 192 (-2.04%)

Mutual labels: proxy, proxypool

Scrapingoutsourcing

ScrapingOutsourcing专注分享爬虫代码尽量每周更新一个

Stars: ✭ 164 (-16.33%)

Mutual labels: crawler, spider

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (-12.76%)

Mutual labels: crawler, spider

View All Similar Projects ➔

ok_ip_proxy_pool😁

一个还ok的IP代理池,先做给自己用着~

运行环境

python 3.7

特点

异步爬取&验证代理🚀
用权重加减来衡量代理的可用性(可用性:通过验证则+1,否则-1)🎭
使用Sqlite,无需安装数据库环境🛴
目前支持的免费代理有: 免费代理/全网/66/西刺/快代理/云代理/IP海

下载&安装

源码下载:

git clone [email protected]:cwjokaka/ok_ip_proxy_pool.git

安装依赖:
```
pip install -r requirements.txt
```

配置文件

# 代理爬虫配置
SPIDER = {
    'crawl_interval': 120,       # 爬取IP代理的间隔(秒)
    'list': [                   # 使用的代理爬虫(类名)
        'Spider66Ip',
        'SpiderQuanWangIp',
        'SpiderXiciIp',
        'SpiderKuaiDaiLiIp',
        'SpiderYunDaiLiIp',
        'SpiderIpHaiIp',
        'SpiderMianFeiDaiLiIp'
    ]
}

# 校验器配置
VALIDATOR = {
    'test_url': 'http://www.baidu.com',     # 可用校验url
    'request_timeout': 4,           # 校验超时时间
    'validate_interval': 60         # 校验间隔(秒)
}

# 匿名性校验配置
ANONYMITY_VALIDATOR = {
    'http_test_url': 'http://httpbin.org/get',      # 匿名校验url
    'https_test_url': 'https://httpbin.org/get',
    'request_timeout': 4,                           # 校验最大超时时间
    'interval': 180                                 # 校验间隔(秒)
}

# 数据库配置
DB = {
    'db_name': 'proxy.db',
    'table_name': 'proxy'
}

# WEB配置(Flask)
WEB_SERVER = {
    'host': '0.0.0.0',
    'port': '8080'
}

# 爬虫请求头
HEADERS = {
    "X-Requested-With": "XMLHttpRequest",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
}

运行

python main.py

API使用

API	method	description
/	GET	首页介绍
/get	GET	获取一个代理
/get_all	GET	获取所有代理

代理爬虫扩展

如果需要添加自定义代理爬虫,可通过以下步骤添加:

进入src/spider/spiders.py
添加自己的爬虫类，继承AbsSpider，实现它的do_crawl & get_page_range & get_urls方法，按需重写其他方法。
用@spider_register修饰此类
在配置文件setting.py的SPIDER['list']中添加此类名

LAST

欢迎Fork|Star|Issue 三连😘

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 196

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗