Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

《数据采集从入门到放弃》源码。内容简介：爬虫介绍、就业情况、爬虫工程师面试题；HTTP协议介绍； Requests使用；解析器Xpath介绍； MongoDB与MySQL；多线程爬虫； Scrapy介绍；Scrapy-redis介绍；使用docker部署；使用nomad管理docker集群；使用EFK查询docker日志

Stars: ✭ 118 (-84.01%)

Mutual labels: scrapy, mongodb

Scrapy demo

all kinds of scrapy demo

Stars: ✭ 128 (-82.66%)

Mutual labels: scrapy, mongodb

Python Spider

豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章

Stars: ✭ 615 (-16.67%)

Mutual labels: scrapy, mongodb

Restheart

RESTHeart - The REST API for MongoDB

Stars: ✭ 659 (-10.7%)

Mutual labels: mongodb

Spring Boot Examples

about learning Spring Boot via examples. Spring Boot 教程、技术栈示例代码，快速简单上手教程。

Stars: ✭ 26,812 (+3533.06%)

Mutual labels: mongodb

Laravel Mongodb

A MongoDB based Eloquent model and Query builder for Laravel (Moloquent)

Stars: ✭ 5,860 (+694.04%)

Mutual labels: mongodb

Faster Than Requests

Faster requests on Python 3

Stars: ✭ 639 (-13.41%)

Mutual labels: scrapy

Rest Api Nodejs Mongodb

A boilerplate for REST API Development with Node.js, Express, and MongoDB

Stars: ✭ 672 (-8.94%)

Mutual labels: mongodb

Nextjs Mongodb App

A Next.js and MongoDB web application, designed with simplicity for learning and real-world applicability in mind.

Stars: ✭ 694 (-5.96%)

Mutual labels: mongodb

Microservices Event Sourcing

Microservices Event Sourcing 是一个微服务架构的在线购物网站，使用Spring Boot、Spring Cloud、Spring Reactor、OAuth2、CQRS 构建，实现了基于Event Sourcing的最终一致性，提供了构建端到端微服务的最佳实践

Stars: ✭ 657 (-10.98%)

Mutual labels: mongodb

Go Carbon

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Stars: ✭ 713 (-3.39%)

Mutual labels: graphite

Vchat

💘🍦🙈Vchat — 从头到脚，撸一个社交聊天系统（vue + node + mongodb）

Stars: ✭ 724 (-1.9%)

Mutual labels: mongodb

React Vue Koa

Vue，React，微信小程序，快应用，TS , Koa, JS一把梭

Stars: ✭ 710 (-3.79%)

Mutual labels: mongodb

Mevn Cli

Light speed setup for MEVN(Mongo Express Vue Node) Apps

Stars: ✭ 696 (-5.69%)

Mutual labels: mongodb

Zxw.framework.netcore

基于EF Core的Code First模式的DotNetCore快速开发框架，其中包括DBContext、IOC组件autofac和AspectCore.Injector、代码生成器（也支持DB First）、基于AspectCore的memcache和Redis缓存组件，以及基于ICanPay的支付库和一些日常用的方法和扩展，比如批量插入、更新、删除以及触发器支持，当然还有demo。欢迎提交各种建议、意见和pr~

Stars: ✭ 691 (-6.37%)

Mutual labels: mongodb

Bifrost

Bifrost ---- 面向生产环境的 MySQL 同步到Redis,MongoDB,ClickHouse,MySQL等服务的异构中间件

Stars: ✭ 701 (-5.01%)

Mutual labels: mongodb

View All Similar Projects ➔

公告因为京东反爬策略的更新，该repo的爬虫有可能已经无法爬取内容，兼之这个爬虫是本人在大三时候编写的，时隔两年多，本人已经工作，没有时间和精力继续更新反反爬策略，遂放弃维护。
概述使用 scrapy, scrapy-redis, graphite 实现的京东分布式爬虫，以 mongodb 实现底层存储。分布式实现，解决带宽和性能的瓶颈，提高爬取的效率。实现 scrapy-redis 对进行 url 的去重以及调度，利用redis的高效和易于扩展能够轻松实现高效率下载：当redis存储或者访问速度遇到瓶颈时，可以通过增大redis集群数和爬虫集群数量改善
版本支持现在支持Py2 和 Py3, 但是需要注意的是，为了兼容Py2, 默认不开启Graphite, 如果需要开启的话，需要Py3 并且修改 settings.py 的 ~~ENABLE_GRAPHITE~~ 字段，默认为False
爬取策略获取标签里面的 url 值，然后迭代爬取，并且把 url 限定在~~xxx.jd.com~~ 范围内，防止无限广度的问题。在爬取某个页面的商品的时候，会把同一个商品的不同规格爬取下来，例如32GIPhone,64GIPhone, 126GIPhone 等。
请求去重策略使用 scrapy_redis.dupefilter.RFPDupeFilter 实现去重，请求入队列的逻辑－ [[https://github.com/rmax/scrapy-redis/blob/31c022dd145654cb4ea1429f09852a82afa0a01c/src/scrapy_redis/scheduler.py#L153][enqueue_request]], 而具体的去重逻辑是调用 [[https://github.com/scrapy/scrapy/blob/acd2b8d43b5ebec7ffd364b6f335427041a0b98d/scrapy/utils/request.py#L19][scrapy.utils.request.request.fingerprint]]
商品去重策略使用 Redis 进行商品去重，将商品的 sku-id 放入Redis, 在将整个商品数据插入到 Mongodb 之前，先检查 Redis 里sku-id 是否已存在
反反爬虫策略 ** 禁用 cookie 通过禁用 cookie, 服务器就无法根据 cookie 判断出爬虫是否访问过网站 ** 伪装成搜索引擎现在可以通过修改 user-agent 伪装成搜索引擎 #+BEGIN_SRC 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', 'Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)', 'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)', 'DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)', 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', 'Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)', 'ia_archiver (+http://www.alexa.com/site/help/webmasters; [email protected])', #+END_SRC ** 轮转 user-agent 为了提高突破反爬虫策略的成功率，定义多个user-agent, 然后每次请求都随机选择 user-agent。本爬虫实现了一个 ~~RotateUserAgentMiddleware~~ 类来实现 user-agent 的轮转 ** 代理 IP 使用代理 IP, 防止 IP 被封
爬虫状态监控将爬虫stats信息(请求个数，item下载个数，dropItem个数，日志)保存到redis中实现了一个针对分布式的stats collector，并将其结果用graphite以图表形式动态实时显示
并发请求和深度控制通过 ~~setting.py~~ 中的 ~~CONCURRENT_REQUESTS = 32~~ 配置来控制并发请求数量，通过 ~~DepthMiddle~~ 类的 ~~DEPTH_LIMIT=max~~ 参数来控制爬虫的的递归深度
项目依赖
- python 3.5+
- scrapy
- scrapy-redis
- pymongo
- graphite (可选)
如何运行 #+BEGIN_SRC sh git clone https://github.com/samrayleung/jd_spider.git #+END_SRC 然后安装 python依赖 #+BEGIN_SRC sh (sudo) pip install -r requirements.txt #+END_SRC ** 安装Graphite(可选) *** docker 安装安装配置 graphite. 需要注意的是 graphite 只适用于 Linux 平台，且安装过程非常麻烦，所以强烈建议使用 docker 进行安装。我基于 [[https://github.com/hopsoft/docker-graphite-statsd][docker-graphite-statsd]] 这个 graphite 的镜像作了些许配置文件的修改，以适配 scrapy. 运行以下命令以拉取并运行 image #+BEGIN_SRC sh sudo docker run -d
--name graphite
--restart=always
-p 80:80
-p 2003-2004:2003-2004
-p 2023-2024:2023-2024
-p 8125:8125/udp
-p 8126:8126
samrayleung/graphite-statsd #+END_SRC 然后就可以在浏览器打开： [[http://localhost/dashboard][dashboard]] 或者是登录到管理界面： [[http://localhost/account/login]] 默认帐号密码是：
- username: root
- password: root *** 手动安装当然，你也可以自己配置 graphite, 在成功配置 graphite 之后，需要修改一些配置：
- 把 ~~/opt/graphite/webapp/content/js/composer_widgets.js~~ 文件中 ~~toggleAutoRefresh~~ 函数里的 ~~interval~~ 变量从60改为1。
- 在配置文件 ~~storage-aggregation.conf~~ 里添加： #+BEGIN_SRC [scrapy_min] pattern = ^scrapy.._min$ xFilesFactor = 0.1 aggregationMethod = min [scrapy_max] pattern = ^scrapy.._max$ xFilesFactor = 0.1 aggregationMethod = max [scrapy_sum] pattern = ^scrapy..*_count$ xFilesFactor = 0.1 aggregationMethod = sum #+END_SRC 而 ~~storage-aggregation.conf~~ 这个配置文件一般是位于 ~~/opt/graphite/conf~~ ** 运行一切准备就绪之后，就可以运行爬虫了。进入到 jd 目录下： #+BEGIN_SRC sh scrapy crawl jindong #+END_SRC ** 注意事项需要注意的是，本项目是含有两只爬虫，爬取商品评论需要先爬取商品信息，因为有了商品信息才能爬取评论 ** 代理 IP 虽然不使用代理 IP 可以爬取商品信息，但是可能爬取一段时间后就无法爬取商品信息，所以需要添加代理 IP. 以 http://ip:port 的形式保存到文本文件，每行一个 IP,然后在 ~~setting~~ 中指定路径： #+BEGIN_SRC python PROXY_LIST = 'path/to/proxy_ip.txt' #+END_SRC 并且去掉下面配置的注释： #+BEGIN_SRC python RETRY_TIMES = 10 RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90, 'scrapy_proxies.RandomProxy': 100, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, } PROXY_MODE = 0 #+END_SRC
运行截图 ** graphite 监控

[[./images/jd_comment_graphite1.png]]

[[./images/jd_comment_graphite2.png]] ** 评论 [[./images/jd_comment.png]] ** 评论总结 [[./images/jd_comment_summary.png]] ** 商品信息 [[./images/jd_parameters.png]] ** Todo ** Done 优化商品去重策略 CLOSED: [2018-03-09 Fri 21:16] Issue:解决 [[https://github.com/samrayleung/jd_spider/issues/6][爬取重复商品]] ** Todo 优化爬取策略 ** Todo 增加新的解析策略 Issue: 解决 [[https://github.com/samrayleung/jd_spider/issues/10][parse book item error]]
ChangeLog ** 2018-9-30
- 新增 Pipenv 支持
- 增加 py2 支持
- 默认不开启 Graphite
- 将爬虫修改回继承 ~~RedisSpider~~
- 修复Github 提示的可能存在漏洞的包
- 感觉JD 的反爬虫策略明显加强，尝试爬了一会，很快被封IP
- 这个应该最后一次Update, 不会再投入精力到这个爬虫项目了 ** 2018-4-4
- 将 Graphite 修改为可选项
参考及致谢
- [[https://github.com/noplay/scrapy-graphite]]
- [[https://github.com/gnemoug/distribute_crawler]]
- https://github.com/hopsoft/docker-graphite-statsd
- [[https://github.com/aivarsk/scrapy-proxies]]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 738

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗