Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Stars: ✭ 38 (-86.28%)

Mutual labels: spider, scrapy

scrapy facebooker

Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.

Stars: ✭ 22 (-92.06%)

Mutual labels: spider, scrapy

163Music

163music spider by scrapy.

Stars: ✭ 60 (-78.34%)

Mutual labels: spider, scrapy

PttImageSpider

PTT 圖片下載器 (抓取整個看板的圖片，並用文章標題作為資料夾的名稱 ) (使用Scrapy)

Stars: ✭ 16 (-94.22%)

Mutual labels: spider, scrapy

toutiao

今日头条科技新闻接口爬虫

Stars: ✭ 17 (-93.86%)

Mutual labels: spider, scrapy

photo-spider-scrapy

10 photo website spiders, 10 个国外图库的 scrapy 爬虫代码

Stars: ✭ 17 (-93.86%)

Mutual labels: spider, scrapy

Tieba spider

百度贴吧爬虫(基于scrapy和mysql)

Stars: ✭ 257 (-7.22%)

Mutual labels: spider, scrapy

Scrapy IPProxyPool

免费 IP 代理池。Scrapy 爬虫框架插件

Stars: ✭ 100 (-63.9%)

Mutual labels: spider, scrapy

python-spider

python爬虫小项目【持续更新】【笔趣阁小说下载、Tweet数据抓取、天气查询、网易云音乐逆向、天天基金网查询、微博数据抓取（生成cookie）、有道翻译逆向、企查查免登陆爬虫、大众点评svg加密破解、B站用户爬虫、拉钩免登录爬虫、自如租房字体加密、知乎问答

Stars: ✭ 45 (-83.75%)

Mutual labels: spider, scrapy

elves

🎊 Design and implement of lightweight crawler framework.

Stars: ✭ 322 (+16.25%)

Mutual labels: spider, scrapy

V2EX Spider

V2EX爬虫

Stars: ✭ 21 (-92.42%)

Mutual labels: spider, scrapy

NScrapy

NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Stars: ✭ 88 (-68.23%)

Mutual labels: spider, scrapy

devsearch

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

Stars: ✭ 52 (-81.23%)

Mutual labels: spider, scrapy

douban-spider

基于Scrapy框架的豆瓣电影爬虫

Stars: ✭ 25 (-90.97%)

Mutual labels: spider, scrapy

Douban Crawler

Uno Crawler por https://douban.com

Stars: ✭ 13 (-95.31%)

Mutual labels: spider, scrapy

View All Similar Projects ➔

All the Places

Website

A project to extract GeoJSON from the web focusing on websites that have 'store locator' pages like restaurants, gas stations, retailers, etc. Each chain has its own bit of software to extract useful information from their site (a "spider"). Each spider can be individually configured to throttle request rate to act as a good citizen on the Internet. The default User-Agent for the spiders can be found here, so websites wishing to prevent our spiders from accessing the data on their website can block that User Agent.

The project is built using scrapy, a Python-based web scraping framework. Each target website gets its own spider, which does the work of extracting interesting details about locations and outputting results in a useful format.

Adding a spider

To scrape a new website for locations, you'll want to create a new spider. You can copy from existing spiders or start from a blank, but the result is always a Python class that has a process() function that yields GeojsonPointItems. The Scrapy framework does the work of outputting the GeoJSON based on these objects that the spider generates.

Development setup

To get started, you'll want to install the dependencies for this project.

This project uses pipenv to handle dependencies and virtual environments. To get started, make sure you have pipenv installed.
With pipenv installed, make sure you have the all-the-places repository checked out
```
git clone [email protected]:alltheplaces/alltheplaces.git
```
Then you can install the dependencies for the project
```
cd alltheplaces
pipenv install
```
After dependencies are installed, make sure you can run the scrapy command without error
```
pipenv run scrapy
```
If pipenv run scrapy ran without complaining, then you have a functional scrapy setup and are ready to write a scraper.

Create a new spider

Create a new file in locations/spiders/ with this content:
```
# -*- coding: utf-8 -*-
import scrapy
from locations.items import GeojsonPointItem

class TemplateSpider(scrapy.Spider):
    name = "template"
    allowed_domains = ["www.sample.com"]
    start_urls = (
        'https://www.sample.com/locations/',
    )

    def parse(self, response):
        pass
```
This blank/template spider will start at the given start_urls, only touch the domains listed in allowed_domains, and all web requests will be returned to the parse() function with response content in the response argument. Once you have the response content, you can perform various operations on it. For example, the most useful is probably running XPath selections on the HTML of the page to extract data out of the page. Check out the "Scraper tips" section below for more information about how to use these tools to efficiently get data out of the page.
Once you have your spider written, you can give it a test run to make sure it's finding the expected results.
```
pipenv run scrapy crawl template
```
The scrapy crawl template command runs a spider named template. If you changed the name of your spider, you should use the name you chose. By default, scrapy crawl does not save the output anywhere, but it does log the results of the spider operation fairly verbosely.

To generate GeoJSON locally, you can enable a couple options during the crawl process to use the GeoJSON exporter and to specify the file to write it to:
```
pipenv run scrapy crawl template \
  --output-format=geojson \
  --output=output.geojson
```

Finally, make sure your parse() function is yielding GeojsonPointItems that contain the location and property data that you extract from the page:

def parse(self, response):
   yield GeojsonPointItem(
       lat=latitude,
       lon=longitude,
       addr_full="1234 Fifth Street",
       city="San Francisco",
       state="CA"
   )

Once you have a spider that logs out useful results, you can create a new branch and push it up to your fork to create a pull request. The build system will run your spider and output information about the results as a comment on your pull request.

Tips for writing a spider

Prefer a directory of all locations

Most listings of locations come in two flavors: a "store finder" that lets the user search by location and a "store directory" that is a hierarchical listing of all locations. These listings are sometimes hidden in the footer or on the site map page. Keep an eye out for these, because it's a lot easier if they enumerate all the locations for you rather than having to program a spider to do it for you. Checking the domain's robots.txt file can also be useful for finding sitemaps (http://<domain>/robots.txt).

If the only option is search by location, there is likely an AJAX query made to search by latitude/longitude. Keep an eye on your browser's developer tools "network" tab to see what the request is so you can replicate it in your spider.

Searchable Points

For store locators that do allow searches by latitude/longitude, a grid of searchable latlon points is available for the US, CA, and Europe here. Each point represents the centroid of a search where the radius distance is indicated in the file name. See the Dollar General scraper for an example of how you might utilize them for national searches.

For stores that do not have a national footprint (e.g. #1034), there are separate point files that include a state/territory attribute e.g. 'us_centroids_100mile_radius_state.csv'. This allows for points to be filtered down to specific states/territories when a national search is unnecessary.

Note: A search radius may overlap multiple states especially when it’s centered near a state boundary. This creates a one to many relationship between the search radius point and the states covered in that search zone. This means that for the state files, there will be records that share the same latlon associated to differing states. The same is true for the European and Canadian territory files.

You can send the spider to other pages

The simplest thing a spider can do is to load the start_urls, process the page, and yield the data as GeojsonPointItem objects from the parse() method. Usually that's not enough to get at useful data, though. The parse() method can also yield a Request object, which scrapy will use to add another URL to the request queue.

By default, the parse() method on the spider will be called with the response for the new request. In many cases it's easier to create a new function to parse the new page's content and pass that function in via the Request object's callback parameter like so:

yield scrapy.Request(
  response.urljoin(store_url.extract()),
  callback=self.parse_store
)

Since the next URL you want to request is usually pulled from an href in the page and relative to the page you're on, you can use the response.urljoin() method as a shortcut to build the URL for the next request.

Using the scrapy shell

Instead of running the scrapy crawl every time you want to try your spider, you can use the Scrapy shell to load a page and experiment with XPath queries. Once you're happy with the query that extracts interesting data you can use it in your spider. This is a whole lot easier than running the whole crawl command every time you make a change to your spider.

To enter the shell, use scrapy shell http://example.com (where you replace the URL with your own). It will dump you into a Python shell after having requested the page and parsing it. Once in the shell, you can do things with the response object as if you were in your spider. The shell also offers a shortcut function called fetch() that lets you pull up a different page.

License

The data generated by our spiders is provided on our website and released under Creative Commons’ CC-0 waiver.

The spider software that produces this data (this repository) is licensed under the MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 277

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (251) 🔗