All Projects → yeungsk → douban-spider

yeungsk / douban-spider

Licence: other
基于Scrapy框架的豆瓣电影爬虫

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to douban-spider

163Music
163music spider by scrapy.
Stars: ✭ 60 (+140%)
Mutual labels:  spider, scrapy
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+220%)
Mutual labels:  spider, scrapy
elves
🎊 Design and implement of lightweight crawler framework.
Stars: ✭ 322 (+1188%)
Mutual labels:  spider, scrapy
small-spider-project
日常爬虫
Stars: ✭ 14 (-44%)
Mutual labels:  spider, scrapy
Scrapy-Spiders
一个基于Scrapy的数据采集爬虫代码库
Stars: ✭ 34 (+36%)
Mutual labels:  spider, scrapy
NScrapy
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider
Stars: ✭ 88 (+252%)
Mutual labels:  spider, scrapy
photo-spider-scrapy
10 photo website spiders, 10 个国外图库的 scrapy 爬虫代码
Stars: ✭ 17 (-32%)
Mutual labels:  spider, scrapy
Spiderkeeper
admin ui for scrapy/open source scrapinghub
Stars: ✭ 2,562 (+10148%)
Mutual labels:  spider, scrapy
python-fxxk-spider
收集各种免费的 Python 爬虫项目
Stars: ✭ 184 (+636%)
Mutual labels:  spider, scrapy
python-spider
python爬虫小项目【持续更新】【笔趣阁小说下载、Tweet数据抓取、天气查询、网易云音乐逆向、天天基金网查询、微博数据抓取(生成cookie)、有道翻译逆向、企查查免登陆爬虫、大众点评svg加密破解、B站用户爬虫、拉钩免登录爬虫、自如租房字体加密、知乎问答
Stars: ✭ 45 (+80%)
Mutual labels:  spider, scrapy
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (+140%)
Mutual labels:  spider, scrapy
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (-12%)
Mutual labels:  spider, scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+236%)
Mutual labels:  spider, scrapy
devsearch
A web search engine built with Python which uses TF-IDF and PageRank to sort search results.
Stars: ✭ 52 (+108%)
Mutual labels:  spider, scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (+836%)
Mutual labels:  spider, scrapy
Scrapy IPProxyPool
免费 IP 代理池。Scrapy 爬虫框架插件
Stars: ✭ 100 (+300%)
Mutual labels:  spider, scrapy
Py Elasticsearch Django
基于python语言开发的千万级别搜索引擎
Stars: ✭ 207 (+728%)
Mutual labels:  spider, scrapy
Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+10304%)
Mutual labels:  spider, scrapy
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (+52%)
Mutual labels:  spider, scrapy
V2EX Spider
V2EX爬虫
Stars: ✭ 21 (-16%)
Mutual labels:  spider, scrapy

豆瓣电影爬虫

使用Scrapy框架爬取豆瓣电影

项目介绍

豆瓣选影视页面分别筛选地区为中国大陆、香港、台湾(可更换为其他地区),构造Ajax请求,获取电影id,再通过id构造电影链接,解析页面后获得电影详细数据,如名称、年份、导演、主演、类型等。具体可见我的博文:爬虫实战(一)利用scrapy爬取豆瓣华语电影

安装

安装Python

至少Python3.5以上

安装Redis和Mongo

安装好之后将Redis和Mongo服务开启

安装依赖

pip3 install -r requirements.txt

运行

配置代理池

cd ProxyPool
cd proxypool

进入ProxyPool的proxypool目录,修改settings.py文件

PASSWORD为Redis密码,如果为空,则设置为None

目前默认的代理为免费代理,如需添加代理,请在crawler.py的Crawler下添加以crawl_开头的函数。

打开代理池和API

cd ProxyPool
python3 run.py

运行scrapy

cd douban
python3 run.py

获取结果

电影数据存储在MongoDB中名为douban数据库的film表中,数据结果如下:

{
    "_id" : ObjectId("5bb96351fd21815bdbe90124"),
    "id" : "24719063",
    "title" : "烈日灼心",
    "year" : "2015",
    "region" : [ "中国大陆"],
    "language" : [ "汉语普通话"],
    "director" : [ "曹保平"],
    "type" : [ "剧情", "悬疑", "犯罪"],
    "actor" : [ "邓超", "段奕宏", "郭涛", "王珞丹", "吕颂贤", "高虎", "白柳汐", "杜志国"],
    "date" : [ "2015-08-27(中国大陆)", "2015-06-15(上海电影节)"],
    "runtime" : [ "139分钟"],
    "rate" : "7.9",
    "rating_num" : "290209"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].