All Projects → michael-yin → Scrapy_guru

michael-yin / Scrapy_guru

Licence: gpl-3.0
Everybody can be scrapy guru

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Scrapy guru

Machine learning lectures
Collection of lectures and lab lectures on machine learning and deep learning. Lab practices in Python and TensorFlow.
Stars: ✭ 118 (-18.62%)
Mutual labels:  tutorials
Git Cheats
Git Cheats - Interactive Cheatsheet For Git Commands
Stars: ✭ 124 (-14.48%)
Mutual labels:  tutorials
Norrisbot
a Slack bot that kicks asses (roundhouse-kicks to be accurate...)
Stars: ✭ 134 (-7.59%)
Mutual labels:  tutorials
Spider Platform
可视化爬虫自动采集平台
Stars: ✭ 119 (-17.93%)
Mutual labels:  scrapy
Python Tutorial
🏃 Some of the python tutorial - 《Python学习笔记》
Stars: ✭ 122 (-15.86%)
Mutual labels:  scrapy
Dialogue.moe
Stars: ✭ 127 (-12.41%)
Mutual labels:  scrapy
Seleniumcrawler
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Stars: ✭ 117 (-19.31%)
Mutual labels:  scrapy
Best Android Tutorials
Best Free Android Tutorials By MindOrks
Stars: ✭ 144 (-0.69%)
Mutual labels:  tutorials
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-15.86%)
Mutual labels:  scrapy
Scrapy demo
all kinds of scrapy demo
Stars: ✭ 128 (-11.72%)
Mutual labels:  scrapy
2018 19 Classes
https://cc-mnnit.github.io/2018-19-Classes/ - 🎒 💻 Material for Computer Club Classes
Stars: ✭ 119 (-17.93%)
Mutual labels:  tutorials
Qqmusicspider
基于Scrapy的QQ音乐爬虫(QQ Music Spider),爬取歌曲信息、歌词、精彩评论等,并且分享了QQ音乐中排名前6400名的内地和港台歌手的49万+的音乐语料
Stars: ✭ 120 (-17.24%)
Mutual labels:  scrapy
Kubernetes Ops
Running Kubernetes in production
Stars: ✭ 127 (-12.41%)
Mutual labels:  tutorials
Awesome Gsoc Roadmap
A comprehensive curated list of available GSOC 2020 Guides, Write-ups and Tutorials 🤠 🏆
Stars: ✭ 119 (-17.93%)
Mutual labels:  tutorials
Pigat
pigat ( Passive Intelligence Gathering Aggregation Tool ) 被动信息收集聚合工具
Stars: ✭ 140 (-3.45%)
Mutual labels:  scrapy
Docs
《数据采集从入门到放弃》源码。内容简介:爬虫介绍、就业情况、爬虫工程师面试题 ;HTTP协议介绍; Requests使用 ;解析器Xpath介绍; MongoDB与MySQL; 多线程爬虫; Scrapy介绍 ;Scrapy-redis介绍; 使用docker部署; 使用nomad管理docker集群; 使用EFK查询docker日志
Stars: ✭ 118 (-18.62%)
Mutual labels:  scrapy
Soul Manga
react + flask + scrapy 构建的单页应用漫画网站
Stars: ✭ 126 (-13.1%)
Mutual labels:  scrapy
Jobspiders
scrapy框架爬取51job(scrapy.Spider),智联招聘(扒接口),拉勾网(CrawlSpider)
Stars: ✭ 144 (-0.69%)
Mutual labels:  scrapy
Imageprocessing
MicaSense RedEdge and Altum image processing tutorials
Stars: ✭ 139 (-4.14%)
Mutual labels:  tutorials
Feapder
feapder是一款支持分布式、批次采集、任务防丢、报警丰富的python爬虫框架
Stars: ✭ 110 (-24.14%)
Mutual labels:  scrapy

This project is deprecated and it has been merged into Scrapy Tutorial Series: Web Scraping Using Python <https://blog.michaelyin.info/scrapy-tutorial-series-web-scraping-using-python/?utm_source=github&utm_medium=website&utm_campaign=scrapy_guru>_

============= Intro

中文版请戳这里 <https://github.com/michael-yin/scrapy_guru/blob/master/readme.zh.rst>_

What is contained in this project.

  1. A list of tasks which covers many basic points in spider development, each task is a short exercise. You will be able to solve real complex problem after you solve the simple tasks step by step. This idea derive from code kata <https://en.wikipedia.org/wiki/Kata_(programming)>_

  2. Some advanced tips and notes which help you improve the development productivity, and it will introduce you some great tools.


Supplement instead of alternative of scrapy doc

Scrapy doc is a good start for people who want to learn to write spider by using scrapy. Since scrapy doc mainly focus on the components and concepts in scrapy, some points which make sense in spider development with scrapy are missed in the doc. That is why I created this project.

I did not talk much in componetns of scrapy in this doc. It is strongly recommend user to read scrapy official doc <https://doc.scrapy.org/en/latest/index.html>_ first to have a basic understanding such as how to create spider and how to run spider in scrapy. You might can not get some points here if you have no idea how the spider work in scrapy. If you have question for scrapy, please check it in official doc first.


Doc

http://scrapy-guru.readthedocs.io/en/latest/index.html


Support Platform

OSX, Linux, python 2.7+, python 3.4+


Get started

First, you should take a view of the workflow figure of this project to know how this project work and read basic concepts <http://scrapy-guru.readthedocs.io/en/latest/#basic-concepts>_ in doc.

Secondly user will choose one task in online doc of project and get started, it is recommended to solve the task in doc order considering the learning curve. User should create spider as doc asked and run the spider to get the data as expected. There is a sample spider callled basic_extract in the project, just follow it to create new one and troubleshoot If user can not make the spider to work, you can also check the working spider code in the solution repo which I will push later.

Thirdly user can get some advaned advise or tips in advanced topic <http://scrapy-guru.readthedocs.io/en/latest/#advanced-topic>_ , you can learn how to enhance your browser to make it more helpful in spider development or other stuff.


Workflow

Please click the image for better resolution.

.. image:: http://scrapy-guru.readthedocs.io/en/latest/_images/scrapy_tuto.png :height: 600px :width: 800px


Project structure

Here is the directory structure::

.
├── docs
│   ├── Makefile
│   ├── build
│   └── source
├── requirements.txt
├── spider_project
│   ├── release
│   ├── scrapy.cfg
│   └── spider_project
└── webapp
    ├── content
    ├── db.sqlite3
    ├── manage.py
    ├── staticfiles
    ├── templates
    └── webapp
  • docs contains the html documentation of this project
  • webapp is a web application developed by Django, we can see this app as a website which show us product info and product links, and we need to write spider to extract the data from it.
  • spider_project is a project based on Scrapy which we should write spider in it to extract data from webapp.

First glance

So here is an example product detail page, it is rendered by webapp mentioned above.

.. image:: http://scrapy-guru.readthedocs.io/en/latest/_images/first_glance.png

Now according to task <http://scrapy-guru.readthedocs.io/en/latest/tasks/basic_extract.html>_ in the doc, we need to extract product title and desc from the product detail page

Here is part of spider code::

class Basic_extractSpider(scrapy.Spider):
    taskid = "basic_extract"
    name = taskid
    entry = "content/detail_basic"

    def parse_entry_page(self, response):
        item = SpiderProjectItem()
        item["taskid"] = self.taskid
        data = {}
        title = response.xpath("//div[@class='product-title']/text()").extract()
        desc = response.xpath("//section[@class='container product-info']//li/text()").extract()
        data["title"] = title
        data["desc"] = desc

        item["data"] = data
        yield item

We can run the spider now, the spider will start to crawl from the self.entry and it will check the data scraped automatically. if the data scraped have some mistake, it will give the detail of the error and help you get the spider work as expect.


Keep going

Read doc of this project for more detail and instruction

http://scrapy-guru.readthedocs.io/en/latest/index.html

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].