Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yeungsk → douban-spider

yeungsk / douban-spider

Licence: other

基于Scrapy框架的豆瓣电影爬虫

Programming Languages

139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to douban-spider

163music spider by scrapy.

Stars: ✭ 60 (+140%)

Mutual labels: spider, scrapy

An open source webapp for scraping: towards a public service for webscraping

Stars: ✭ 80 (+220%)

Mutual labels: spider, scrapy

🎊 Design and implement of lightweight crawler framework.

Stars: ✭ 322 (+1188%)

Mutual labels: spider, scrapy

small-spider-project

日常爬虫

Stars: ✭ 14 (-44%)

Mutual labels: spider, scrapy

一个基于Scrapy的数据采集爬虫代码库

Stars: ✭ 34 (+36%)

Mutual labels: spider, scrapy

NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Stars: ✭ 88 (+252%)

Mutual labels: spider, scrapy

photo-spider-scrapy

10 photo website spiders, 10 个国外图库的 scrapy 爬虫代码

Stars: ✭ 17 (-32%)

Mutual labels: spider, scrapy

admin ui for scrapy/open source scrapinghub

Stars: ✭ 2,562 (+10148%)

Mutual labels: spider, scrapy

python-fxxk-spider

收集各种免费的 Python 爬虫项目

Stars: ✭ 184 (+636%)

Mutual labels: spider, scrapy

python爬虫小项目【持续更新】【笔趣阁小说下载、Tweet数据抓取、天气查询、网易云音乐逆向、天天基金网查询、微博数据抓取（生成cookie）、有道翻译逆向、企查查免登陆爬虫、大众点评svg加密破解、B站用户爬虫、拉钩免登录爬虫、自如租房字体加密、知乎问答

Stars: ✭ 45 (+80%)

Mutual labels: spider, scrapy

Iota is a web scraper which can find all of the images and links/suburls on a webpage

Stars: ✭ 60 (+140%)

Mutual labels: spider, scrapy

scrapy facebooker

Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.

Stars: ✭ 22 (-12%)

Mutual labels: spider, scrapy

Dynamic configurable crawl (动态可配置化爬虫)

Stars: ✭ 84 (+236%)

Mutual labels: spider, scrapy

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

Stars: ✭ 52 (+108%)

Mutual labels: spider, scrapy

招聘网数据爬虫

Stars: ✭ 234 (+836%)

Mutual labels: spider, scrapy

Scrapy IPProxyPool

免费 IP 代理池。Scrapy 爬虫框架插件

Stars: ✭ 100 (+300%)

Mutual labels: spider, scrapy

Py Elasticsearch Django

基于python语言开发的千万级别搜索引擎

Stars: ✭ 207 (+728%)

Mutual labels: spider, scrapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Stars: ✭ 2,601 (+10304%)

Mutual labels: spider, scrapy

scrapy-distributed

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Stars: ✭ 38 (+52%)

Mutual labels: spider, scrapy

V2EX爬虫

Stars: ✭ 21 (-16%)

Mutual labels: spider, scrapy

View All Similar Projects ➔

豆瓣电影爬虫

使用Scrapy框架爬取豆瓣电影

项目介绍

豆瓣选影视页面分别筛选地区为中国大陆、香港、台湾（可更换为其他地区），构造Ajax请求，获取电影id，再通过id构造电影链接，解析页面后获得电影详细数据，如名称、年份、导演、主演、类型等。具体可见我的博文：爬虫实战（一）利用scrapy爬取豆瓣华语电影。

安装

安装Python

至少Python3.5以上

安装Redis和Mongo

安装好之后将Redis和Mongo服务开启

安装依赖

pip3 install -r requirements.txt

运行

配置代理池

cd ProxyPool
cd proxypool

进入ProxyPool的proxypool目录，修改settings.py文件

PASSWORD为Redis密码，如果为空，则设置为None

目前默认的代理为免费代理，如需添加代理，请在crawler.py的Crawler下添加以crawl_开头的函数。

打开代理池和API

cd ProxyPool
python3 run.py

运行scrapy

cd douban
python3 run.py

获取结果

电影数据存储在MongoDB中名为douban数据库的film表中，数据结果如下：

{
    "_id" : ObjectId("5bb96351fd21815bdbe90124"),
    "id" : "24719063",
    "title" : "烈日灼心",
    "year" : "2015",
    "region" : [ "中国大陆"],
    "language" : [ "汉语普通话"],
    "director" : [ "曹保平"],
    "type" : [ "剧情", "悬疑", "犯罪"],
    "actor" : [ "邓超", "段奕宏", "郭涛", "王珞丹", "吕颂贤", "高虎", "白柳汐", "杜志国"],
    "date" : [ "2015-08-27(中国大陆)", "2015-06-15(上海电影节)"],
    "runtime" : [ "139分钟"],
    "rate" : "7.9",
    "rating_num" : "290209"
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 25

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗