All Projects → hanxweb → Scrapy-SearchEngines

hanxweb / Scrapy-SearchEngines

Licence: other
bing、google、baidu搜索引擎爬虫。python3.6 and scrapy

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Scrapy-SearchEngines

Image Downloader
Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载.
Stars: ✭ 1,173 (+4089.29%)
Mutual labels:  bing, scrapy, baidu
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (+10875%)
Mutual labels:  scrapy, baidu
Xinahn Socket
一个开源,高隐私,自架自用的聚合搜索引擎。 https://xinahn.com
Stars: ✭ 77 (+175%)
Mutual labels:  bing, baidu
Jsearch
jSearch(聚搜) 是一款专注内容的chrome搜索扩展,一次搜索聚合多平台内容。
Stars: ✭ 193 (+589.29%)
Mutual labels:  bing, baidu
Translators
🌏🌍🌎Translators🌎🌍🌏 is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python. Translators是一个旨在用Python为个人和学生带来免费、多样、愉快翻译的库。
Stars: ✭ 295 (+953.57%)
Mutual labels:  bing, baidu
Sitedorks
Search Google/Bing/Ecosia/DuckDuckGo/Yandex/Yahoo for a search term with a default set of websites, bug bounty programs or a custom collection.
Stars: ✭ 221 (+689.29%)
Mutual labels:  bing, baidu
ArticleSpider
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).
Stars: ✭ 34 (+21.43%)
Mutual labels:  scrapy
scrapy-kafka-redis
Distributed crawling/scraping, Kafka And Redis based components for Scrapy
Stars: ✭ 45 (+60.71%)
Mutual labels:  scrapy
ty-baidu-textcensor
🗑在Typecho中加入百度文本内容审核,过滤评论中的敏感内容
Stars: ✭ 42 (+50%)
Mutual labels:  baidu
Scrape-Finance-Data
My code for scraping financial data in Vietnam
Stars: ✭ 13 (-53.57%)
Mutual labels:  scrapy
bing-daily-photo
A simple PHP class to fetch Bing's photo of the day.
Stars: ✭ 34 (+21.43%)
Mutual labels:  bing
Raspagem-de-dados-para-iniciantes
Raspagem de dados para iniciante usando Scrapy e outras libs básicas
Stars: ✭ 113 (+303.57%)
Mutual labels:  scrapy
Inventus
Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.
Stars: ✭ 80 (+185.71%)
Mutual labels:  scrapy
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+228.57%)
Mutual labels:  scrapy
bing-wallpaper
Python Skript that sets the daily www.bing.com picture as a Desktop Wallpaper
Stars: ✭ 21 (-25%)
Mutual labels:  bing
mpapi
🐤 小程序API兼容插件,一次编写,多端运行。支持:微信小程序、支付宝小程序、百度智能小程序、字节跳动小程序
Stars: ✭ 40 (+42.86%)
Mutual labels:  baidu
easypoi
简单、免费、高效的百度地图poi采集和分析工具。
Stars: ✭ 87 (+210.71%)
Mutual labels:  scrapy
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+339.29%)
Mutual labels:  scrapy
scrapy-mysql-pipeline
scrapy mysql pipeline
Stars: ✭ 47 (+67.86%)
Mutual labels:  scrapy
hupu spider
虎扑步行街爬虫
Stars: ✭ 22 (-21.43%)
Mutual labels:  scrapy

seCrawler(Search Engine Crawler)

A scrapy project can crawl search result of Google/Bing/Baidu

Copying by https://github.com/xtt129/seCrawler

Thank you for sharing

prerequisite

python 3.6 and scrapy is needed.

commands

run one command to get 50 pages result from search engine with keyword, the result would be kept in the "urls.txt" under the current directory.

####Bing scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50

####Baidu scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50

####Google scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50

limitation

The project doesn't provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc.

But to reduce these measures, we recommand to set DOWNLOAD_DELAY=10 in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in Scrapy Setting.

Chinese

本项目用于bing、google、baidu搜索引擎关键词的抓链,基于python 3.6和scrapy。

根据 https://github.com/xtt129/seCrawler 提供的项目进行小小改动以适应3.6版本。

使用方法: ---进入项目目录执行指令---

Bing:

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50

Baidu:

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50

Google:

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50

本项目没有保护IP的功能,过度爬取可能会被封杀IP,可以尝试延长下载时间间隔: 在settings.py中进行配置,例:DOWNLOAD_DELAY=10

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].