Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+8100%)

Mutual labels: spider

Go spider

A golang spider

Stars: ✭ 25 (+212.5%)

Mutual labels: spider

Baiduyunspider

百度云网盘搜索引擎，包含爬虫 & 网站

Stars: ✭ 903 (+11187.5%)

Mutual labels: spider

Gospider

Gospider - Fast web spider written in Go

Stars: ✭ 785 (+9712.5%)

Mutual labels: spider

Creeper

🐾 Creeper - The Next Generation Crawler Framework (Go)

Stars: ✭ 762 (+9425%)

Mutual labels: spider

Seeker

Seeker - another job board aggregator.

Stars: ✭ 16 (+100%)

Mutual labels: spider

Querido Diario

📰 Brazilian government gazettes, accessible to everyone.

Stars: ✭ 681 (+8412.5%)

Mutual labels: spider

Mailinglistscraper

A python web scraper for public email lists.

Stars: ✭ 19 (+137.5%)

Mutual labels: spider

Oneblog

👽 OneBlog，一个简洁美观、功能强大并且自适应的Java博客

Stars: ✭ 678 (+8375%)

Mutual labels: spider

Funpyspidersearchengine

Word2vec 千人千面个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索

Stars: ✭ 782 (+9675%)

Mutual labels: spider

Javlibrary

Javlibrary spider

Stars: ✭ 17 (+112.5%)

Mutual labels: spider

Crawler

A high performance web crawler in Elixir.

Stars: ✭ 781 (+9662.5%)

Mutual labels: spider

View All Similar Projects ➔

一个简单的豆瓣信息爬虫

一、豆瓣同城数据爬取

目标网址：http://www.douban.com/location/xian/

抓取豆瓣同城栏目所有有关的活动信息，具体指标如下：
1. 抓取西安、北京、上海、武汉等城市;
2. 抓取同城的音乐类、戏剧、讲座、聚会、电影、展览、运动、公益、旅行、其他等;
3. 每一个活动需要获取：活动名称、活动id、时间、地点、费用、类型、主办方、感兴趣的人、参加的人等（其中感兴趣的人以只需抓取人名对应的uid）。
数据存储格式要求：
1. 每一个活动以一个txt文件存储，文件名命名规则为“地名_活动类型_活动id.txt”;
2. 文件内，以上各个字段分别各占一行，其中参加及感兴趣的人的uid单独一行，uid之间以逗号（英文逗号）分隔。

二、豆瓣线上活动抓取

目标网址：http://www.douban.com/online/

抓取所有线上活动（可以标签为索引）
具体指标：
1. 抓取每一个线上活动的标题及活动id;
2. 获得每一个活动的发起者以及所有参与者的uid;
3. 数据存储格式要求：
  1. 所有活动存储在一个txt文件中;
  2. 每一行代表一个活动;
  3. 每一行至少分为四个字段，第一个字段为活动id，第二字段为活动名称（活动名称中去除所有标点符号），第三字段为发起人的uid，第四字段及以后为参与人uid;
  4. 字段之间用逗号（英文逗号）隔开。

三、MORE

豆瓣开放了api接口，可以直接使用它提供的api 图书Api V2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 8

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗