All Projects → A1014280203 → Ugly-Distributed-Crawler

A1014280203 / Ugly-Distributed-Crawler

Licence: MPL-2.0 license
基于Redis实现的简单到爆的分布式爬虫

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ugly-Distributed-Crawler

sfm
simple file manager
Stars: ✭ 163 (+262.22%)
Mutual labels:  simple
plain-modal
The simple library for customizable modal window.
Stars: ✭ 21 (-53.33%)
Mutual labels:  simple
perf
PERF is an Exhaustive Repeat Finder
Stars: ✭ 26 (-42.22%)
Mutual labels:  simple
degiro-trading-tracker
Simplified tracking of your investments
Stars: ✭ 16 (-64.44%)
Mutual labels:  simple
CleanUI
Android library to create beautiful, clean and minimal UIs.
Stars: ✭ 19 (-57.78%)
Mutual labels:  simple
HTML-Crypto-Currency-Chart-Snippets
💹 Simple HTML Snippets to create Tickers / Charts of Cryptocurrencies with the TradingView API 💹
Stars: ✭ 89 (+97.78%)
Mutual labels:  simple
touchMyRipple
A simple library for apply the ripple effect where you want
Stars: ✭ 19 (-57.78%)
Mutual labels:  simple
hascal
Hascal is a general purpose and open source programming language designed to build optimal, maintainable, reliable and efficient software.
Stars: ✭ 56 (+24.44%)
Mutual labels:  simple
DM-BOT
📧 DM-BOT is discord bot that can record direct messages. One of us! You can also reply to those messages! DM-BOT is easy to use & understand! I decided to use Discord.js, it's literally the best.
Stars: ✭ 31 (-31.11%)
Mutual labels:  simple
elcalc
➗ Cross-Platform calculator built with Electron!
Stars: ✭ 88 (+95.56%)
Mutual labels:  simple
react-native-panel
A Customizable React Native Panel for Android and iOS
Stars: ✭ 35 (-22.22%)
Mutual labels:  simple
simple json
Simple way to dynamically convert from and to JSON using build-time generators given a type.
Stars: ✭ 15 (-66.67%)
Mutual labels:  simple
Creamy
A simple CMS in the style of Perch.
Stars: ✭ 32 (-28.89%)
Mutual labels:  simple
add-to-calendar-button
A convenient JavaScript snippet, which lets you create beautiful buttons, where people can add events to their calendars.
Stars: ✭ 697 (+1448.89%)
Mutual labels:  simple
simple-debug.css
Debug your layouts with one line of CSS
Stars: ✭ 32 (-28.89%)
Mutual labels:  simple
wasm-joey
Serverless Wasm - A lightweight Node.js application for deploying and executing WebAssembly(Wasm) binary-code via HTTP
Stars: ✭ 48 (+6.67%)
Mutual labels:  simple
double-sdk
A simple way to write CS:GO cheats!
Stars: ✭ 15 (-66.67%)
Mutual labels:  simple
Simple-YouTube-Downloader
YouTube download client with focus on simplicity
Stars: ✭ 31 (-31.11%)
Mutual labels:  simple
django-menu-generator
A straightforward menu generator for Django
Stars: ✭ 24 (-46.67%)
Mutual labels:  simple
untheme
A blank WordPress theme for developers.
Stars: ✭ 82 (+82.22%)
Mutual labels:  simple

Ugly-Distributed-Crawler

简陋的分布式爬虫

新手向,基于Redis构建的分布式爬虫。 以爬取考研网的贴子为例,利用 PyQuery, lxml 进行解析,将符合要求的文章文本存入MySQ数据库中。

结构简介

cooperator

协作模块,用于为Master&Worker模块提供代理IP支持

master

提取满足条件的文章url,并交给Worker进一步处理

Worker

解析文章内容,将符合要求的存入数据库

环境依赖

sqlalchemy => 1.0.13
pyquery => 1.2.17
requests => 2.12.3
redis => 2.10.5
lxml => 3.6.0

  1. 需要预先安装MySQL-server 和 Redis-server.
  2. MySQL中应有名为kybsrc的数据库,且该数据库包含一个名为posts的表,拥有num(INT AUTO_INCREMENT)和post(TEXT)两个字段。

如何启动

0. 先配置好各模块所引用的配置文件

尤其是Redis服务器和MySQL服务器的IP地址、端口,用于登录的用户名、密码等

1. 为了更好地运行,cooperator/start.py 应提前开始并完成一次工作函数执行

第一次执行完后,每五分钟运行一次工作函数

2. 启动 master/start.py

默认只执行一次

3. 启动 worker/start.py

默认循环监听是否有新的URL待解析

核心点说明

1. 通过Redis的集合类型进行代理IP和URL的传递

# Summary Reference
# ---------
# 创建句柄
def make_redis_handler():
    pool = redis.ConnectionPool(host=r_server['ip'], port=r_server['port'], password=r_server['passwd'])
    return redis.Redis(connection_pool=pool)

# 获得句柄
def make_proxy_handler():
    return make_redis_handler()

# 保存到指定的set下
def check_and_save(self, proxy):
 'pass'
   self.redis_handler.sadd(r_server['s_name'], proxy)

2. 由于在验证代理IP和使用封装的get_url()函数的时候网络IO较多,所以使用多线程(效果还是很明显的)。

#Summary Reference
#---------
def save_proxy_ip(self):
    'pass'
    for proxy in self.proxy_ip:
        Thread(target=self.check_and_save, args=(proxy,)).start()

def get_url(url):
    'pass'
    while True:
    'pass'
        resp = request('get', url, headers=headers, proxies={'http': proxy})
    'pass'

项目地址

https://github.com/A1014280203/Ugly-Distributed-Crawler

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].