All Projects → xbynet → crawler

xbynet / crawler

Licence: MIT License
A simple and flexible web crawler framework for java.

Programming Languages

java
68154 projects - #9 most used programming language
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to crawler

Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (+1725%)
Mutual labels:  crawler, spider, jsoup
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+77575%)
Mutual labels:  crawler, spider
Jssoup
JavaScript + BeautifulSoup = JSSoup
Stars: ✭ 203 (+915%)
Mutual labels:  crawler, spider
Chromium for spider
dynamic crawler for web vulnerability scanner
Stars: ✭ 220 (+1000%)
Mutual labels:  crawler, spider
Ok ip proxy pool
🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池
Stars: ✭ 196 (+880%)
Mutual labels:  crawler, spider
Zhihuspider
多线程知乎用户爬虫,基于python3
Stars: ✭ 201 (+905%)
Mutual labels:  crawler, spider
Jd mask robot
京东口罩库存监控爬虫(非selenium),扫码登录、查价、加购、下单、秒杀
Stars: ✭ 216 (+980%)
Mutual labels:  crawler, spider
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (+830%)
Mutual labels:  crawler, spider
Awesome Java Crawler
本仓库收集整理爬虫相关资源,开发语言以Java为主
Stars: ✭ 228 (+1040%)
Mutual labels:  crawler, jsoup
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+1055%)
Mutual labels:  crawler, jsoup
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+1085%)
Mutual labels:  crawler, spider
Fooproxy
稳健高效的评分制-针对性- IP代理池 + API服务,可以自己插入采集器进行代理IP的爬取,针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库,支持MongoDB 4.0 使用 Python3.7(Scored IP proxy pool ,customise proxy data crawler can be added anytime)
Stars: ✭ 195 (+875%)
Mutual labels:  crawler, spider
Gecco
Easy to use lightweight web crawler(易用的轻量化网络爬虫)
Stars: ✭ 2,310 (+11450%)
Mutual labels:  crawler, jsoup
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+11860%)
Mutual labels:  crawler, spider
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+850%)
Mutual labels:  crawler, spider
Webvideobot
Web crawler.
Stars: ✭ 214 (+970%)
Mutual labels:  crawler, spider
Magic google
Google search results crawler, get google search results that you need
Stars: ✭ 247 (+1135%)
Mutual labels:  crawler, spider
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (+810%)
Mutual labels:  crawler, spider
Lianjia Beike Spider
链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个中国主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。
Stars: ✭ 2,257 (+11185%)
Mutual labels:  crawler, spider
Laravel Crawler Detect
A Laravel wrapper for CrawlerDetect - the web crawler detection library
Stars: ✭ 227 (+1035%)
Mutual labels:  crawler, spider

crawler

A simple and flexible web crawler framework for java.

Features:

1、Code is easy to understand and customized (代码简单易懂,可定制性强)
2、Api is simple and easy to use
3、Support File download、Content part fetch.(支持文件下载、分块抓取)    
4、Request And Response support much options、strong customizable.(请求和响应支持的内容和选项比较丰富、每个请求可定制性强)
5、Support do your own operation before or after network request in downloader(支持网络请求前后执行自定义操作)      
6、Selenium+PhantomJS support  
7、Redis support

Future:

1、Complete the code comment and test(完善代码注释和完善测试代码)  

Install:

The only module that must be added is crawler-core

<dependency>
    <groupId>com.github.xbynet</groupId>
    <artifactId>crawler-core</artifactId>
    <version>0.3.0</version>
</dependency

But if you want to use selenium support:

<dependency>
    <groupId>com.github.xbynet</groupId>
    <artifactId>crawler-selenium</artifactId>
    <version>0.3.0</version>
</dependency

Module crawler-server is now a experimental attempt, and now has more work to do on it.

Demo:

import com.github.xbynet.crawler.http.DefaultDownloader;
import com.github.xbynet.crawler.http.FileDownloader;
import com.github.xbynet.crawler.http.HttpClientFactory;
import com.github.xbynet.crawler.parser.JsoupParser;
import com.github.xbynet.crawler.scheduler.DefaultScheduler;

public class GithubCrawler extends Processor {
	@Override
	public void process(Response resp) {
		String currentUrl = resp.getRequest().getUrl();
		System.out.println("CurrentUrl:" + currentUrl);
		int respCode = resp.getCode();
		System.out.println("ResponseCode:" + respCode);
		System.out.println("type:" + resp.getRespType().name());
		String contentType = resp.getContentType();
		System.out.println("ContentType:" + contentType);
		Map<String, List<String>> headers = resp.getHeaders();
		System.out.println("ResonseHeaders:");
		for (String key : headers.keySet()) {
			List<String> values=headers.get(key);
			for(String str:values){
				System.out.println(key + ":" +str);
			}
		}
		JsoupParser parser = resp.html();
		// suppport parted ,分块抓取是会有个parent response来关联所有分块response
		// System.out.println("isParted:"+resp.isPartResponse());
		// Response parent=resp.getParentResponse();
		// resp.addPartRequest(null);
		//Map<String,Object> extras=resp.getRequest().getExtras();

		if (currentUrl.equals("https://github.com/xbynet")) {
			String avatar = parser.single("img.avatar", "src");
			String dir = System.getProperty("java.io.tmpdir");
			String savePath = Paths.get(dir, UUID.randomUUID().toString())
					.toString();
			boolean avatarDownloaded = download(avatar, savePath);
			System.out.println("avatar:" + avatar + ", saved:" + savePath);
			// System.out.println("avtar downloaded status:"+avatarDownloaded);
			String name = parser.single(".vcard-names > .vcard-fullname",
					"text");
			System.out.println("name:" + name);
			List<String> reponames = parser.list(
					".pinned-repos-list .repo.js-repo", "text");
			List<String> repoUrls = parser.list(
					".pinned-repo-item .d-block >a", "href");
			System.out.println("reponame:url");
			if (reponames != null) {
				for (int i = 0; i < reponames.size(); i++) {
					String tmpUrl="https://github.com"+repoUrls.get(i);
					System.out.println(reponames.get(i) + ":"+tmpUrl);
					Request req=new Request(tmpUrl).putExtra("name", reponames.get(i));
					resp.addRequest(req);
				}
			}
		}else{
			Map<String,Object> extras=resp.getRequest().getExtras();
			String name=extras.get("name").toString();
			System.out.println("repoName:"+name);
			String shortDesc=parser.single(".repository-meta-content","allText");
			System.out.println("shortDesc:"+shortDesc);
		}
	}

	public void start() {
		Site site = new Site();
		Spider spider = Spider.builder(this).threadNum(5).site(site)
				.urls("https://github.com/xbynet").build();
		spider.run();
	}
  
	public static void main(String[] args) {
		new GithubCrawler().start();
	}
  
  
	public void startCompleteConfig() {
		String pcUA = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
		String androidUA = "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36";

		Site site = new Site();
		site.setEncoding("UTF-8").setHeader("Referer", "https://github.com/")
				.setRetry(3).setRetrySleep(3000).setSleep(50).setTimeout(30000)
				.setUa(pcUA);

		Request request = new Request("https://github.com/xbynet");
		HttpClientContext ctx = new HttpClientContext();
		BasicCookieStore cookieStore = new BasicCookieStore();
		ctx.setCookieStore(cookieStore);
		request.setAction(new RequestAction() {
			@Override
			public void before(CloseableHttpClient client, HttpUriRequest req) {
				System.out.println("before-haha");
			}

			@Override
			public void after(CloseableHttpClient client,
					CloseableHttpResponse resp) {
				System.out.println("after-haha");
			}
		}).setCtx(ctx).setEncoding("UTF-8")
				.putExtra("somekey", "I can use in the response by your own")
				.setHeader("User-Agent", pcUA).setMethod(Const.HttpMethod.GET)
				.setPartRequest(null).setEntity(null)
				.setParams("appkeyqqqqqq", "1213131232141").setRetryCount(5)
				.setRetrySleepTime(10000);

		Spider spider = Spider.builder(this).threadNum(5)
				.name("Spider-github-xbynet")
				.defaultDownloader(new DefaultDownloader())
				.fileDownloader(new FileDownloader())
				.httpClientFactory(new HttpClientFactory()).ipProvider(null)
				.listener(null).pool(null).scheduler(new DefaultScheduler())
				.shutdownOnComplete(true).site(site).build();
		spider.run();
	}


}

Examples:

  • Github(github个人项目信息)
  • OSChinaTweets(开源中国动弹)
  • Qiushibaike(醜事百科)
  • Neihanshequ(内涵段子)
  • ZihuRecommend(知乎推荐)

More Examples: Please see here

Thanks:

webmagic:本项目借鉴了webmagic多处代码,设计上也作了较多参考,非常感谢。
xsoup:本项目使用xsoup作为底层xpath处理器  
JsonPath:本项目使用JsonPath作为底层jsonpath处理器
Jsoup 本项目使用Jsoup作为底层HTML/XML处理器
HttpClient 本项目使用HttpClient作为底层网络请求工具

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].