All Projects → abola → Crawlerpack

abola / Crawlerpack

Licence: apache-2.0
Java 網路資料爬蟲包

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Crawlerpack

Parsrs
CSV, JSON, XML text parsers and generators written in pure POSIX shellscript
Stars: ✭ 56 (-43.43%)
Mutual labels:  json, xml
Countries States Cities Database
🌍 World countries, states, regions, provinces, cities, towns in JSON, SQL, XML, PLIST, YAML, and CSV. All Countries, States, Cities with ISO2, ISO3, Country Code, Phone Code, Capital, Native Language, Timezones, Latitude, Longitude, Region, Subregion, Flag Emoji, and Currency. #countries #states #cities
Stars: ✭ 1,130 (+1041.41%)
Mutual labels:  json, xml
Feedr
Use feedr to fetch the data from a remote url, respect its caching, and parse its data. Despite its name, it's not just for feed data but also for all data that you can feed into it (including binary data).
Stars: ✭ 56 (-43.43%)
Mutual labels:  json, xml
Fast Xml Parser
Validate XML, Parse XML to JS/JSON and vise versa, or parse XML to Nimn rapidly without C/C++ based libraries and no callback
Stars: ✭ 1,021 (+931.31%)
Mutual labels:  json, xml
Internettools
XPath/XQuery 3.1 interpreter for Pascal with compatibility modes for XPath 2.0/XQuery 1.0/3.0, custom and JSONiq extensions, XML/HTML parsers and classes for HTTP/S requests
Stars: ✭ 82 (-17.17%)
Mutual labels:  json, xml
Ansible Config encoder filters
Ansible role used to deliver the Config Encoder Filters.
Stars: ✭ 48 (-51.52%)
Mutual labels:  json, xml
Ediengine
Simple .NET EDI X12 Reader, Writer and Validator. EDI JSON Serialization and Deserialization. Written in C#
Stars: ✭ 61 (-38.38%)
Mutual labels:  json, xml
Xml Js
Converter utility between XML text and Javascript object / JSON text.
Stars: ✭ 874 (+782.83%)
Mutual labels:  json, xml
Jsoup
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
Stars: ✭ 9,184 (+9176.77%)
Mutual labels:  jsoup, xml
Jokeapi
A REST API that serves uniformly and well formatted jokes in JSON, XML, YAML or plain text format that also offers a great variety of filtering methods
Stars: ✭ 71 (-28.28%)
Mutual labels:  json, xml
News Please
news-please - an integrated web crawler and information extractor for news that just works.
Stars: ✭ 969 (+878.79%)
Mutual labels:  json, crawler
Filecontextcore
FileContextCore is a "Database"-Provider for Entity Framework Core and adds the ability to store information in files instead of being limited to databases.
Stars: ✭ 91 (-8.08%)
Mutual labels:  json, xml
Evreflection
Reflection based (Dictionary, CKRecord, NSManagedObject, Realm, JSON and XML) object mapping with extensions for Alamofire and Moya with RxSwift or ReactiveSwift
Stars: ✭ 954 (+863.64%)
Mutual labels:  json, xml
Java Client Api
Java client for the MarkLogic enterprise NoSQL database
Stars: ✭ 52 (-47.47%)
Mutual labels:  json, xml
Treefrog Framework
TreeFrog Framework : High-speed C++ MVC Framework for Web Application
Stars: ✭ 885 (+793.94%)
Mutual labels:  json, xml
Fhir.js
Node.JS library for serializing/deserializing FHIR resources between JS/JSON and XML using various node.js XML libraries
Stars: ✭ 61 (-38.38%)
Mutual labels:  json, xml
Rss Parser
A lightweight RSS parser, for Node and the browser
Stars: ✭ 793 (+701.01%)
Mutual labels:  json, xml
Ps Webapi
(Migrated from CodePlex) Let PowerShell Script serve or command-line process as WebAPI. PSWebApi is a simple library for building ASP.NET Web APIs (RESTful Services) by PowerShell Scripts or batch/executable files out of the box.
Stars: ✭ 24 (-75.76%)
Mutual labels:  json, xml
Magento2 Import Export Sample Files
Default Magento 2 CE import / export CSV files & sample files for Firebear Improved Import / Export extension
Stars: ✭ 68 (-31.31%)
Mutual labels:  json, xml
Dbwebapi
(Migrated from CodePlex) DbWebApi is a .Net library that implement an entirely generic Web API (RESTful) for HTTP clients to call database (Oracle & SQL Server) stored procedures or functions in a managed way out-of-the-box without any configuration or coding.
Stars: ✭ 84 (-15.15%)
Mutual labels:  json, xml

Java 網路資料爬蟲包

Maven Central Travis-ci build status

本套件為網路上常見的資料協定、格式,提供了簡易且方便(easy-to-use)的操作接口。套件主要以Jsoup為核心擴展,整合Apache Commons-VFS後,提供更多種協定的操作,也可支援壓縮格式處理。

Requires JDK 1.7 or higher

To add a dependency on CrawlerPack using Maven, use the following:

<dependency>
    <groupId>com.github.abola</groupId>
    <artifactId>crawler</artifactId>
    <version>1.1.1</version>
</dependency>

To add a dependency using Gradle:

dependencies {
    compile 'com.github.abola:crawler:1.1'
}

And easy-to-use example

// URI format source
String uri = "https://raw.githubusercontent.com/abola/CrawlerPack/master/test.json";
    
CrawlerPack.start()
    .getFromJson(uri)
    .select("results name").text() ;

爬蟲包特色

支援常見網路協定

使用 Apache Commons-VFS 所支援所有協定,常見網路協定如http/https,samba(cifs),ftp,sftp等…詳細列表請參考 https://commons.apache.org/proper/commons-vfs/filesystems.html

支援常見資料格式

  • JSON
  • XML
  • HTML

支援 中文XML標籤 / 中文JSON欄位

爬蟲包套件可正常的處理使用中文命名的XML或JSON

XML

<集合>
    <元素>元素名稱1</元素>
    <元素>元素名稱2</元素>
</集合>

JSON

{"集合":[
    {"元素":"元素名稱1"}
    , {"元素":"元素名稱2"}
]}

自動偵測遠端資料編碼

爬蟲包建議使用 UTF-8 操作資料。針對非 UTF-8 編碼的遠端資料,爬蟲包預設會啟動自動偵測編碼,並將其轉換為 UTF-8 編碼。

注意,預設啟用的自動編碼,效能會明顯的不如直接指定編碼,平均測試較直接指定編碼的目標多出300ms以上耗費時間。如果遠端資料非 UTF-8 編碼,大量資料擷取時,直接指定遠端編碼,可有效減少你作業整體耗時。

// TWSE 2015'三大法人買賣金額統計表
String uri = "http://www.twse.com.tw/ch/trading/fund/BFI82U/BFI82U_print.php"
            +"?begin_date=20150101&end_date=20151231&report_type=month";

CrawlerPack.start()
    .setRemoteEncoding("big5")  // 直接指定遠端編碼
    .getFromHtml(uri)
    .select("table.board_trad > tbody > tr:nth-child(7) > td:nth-child(4)").text()

一般使用範例

JSON format example

// 即時PM2.5資料
String uri = "http://opendata2.epa.gov.tw/AQX.json";

CrawlerPack.start()
    .getFromJson(uri)
    .getElementsByTag("pm2.5").text();

XML format example

// 104司人力銀行上 10萬月薪以上的工作資料
String uri = "http://www.104.com.tw/i/apis/jobsearch.cfm?order=2&fmt=4&cols=JOB,NAME&slmin=100000&sltp=S&pgsz=20";
    
CrawlerPack.start()
    .getFromXml(uri)
    .select("item").get(0).attr("job") ;

Html format example

// ptt 笨版最新文章列表
String uri = "https://www.ptt.cc/bbs/StupidClown/index.html";

CrawlerPack.start()
    .getFromHtml(uri)
    .select("div.title > a").text();

set userAgent example (CrawlerPack >= 1.1)

System.out.println(
  CrawlerPack.start()
    .setUserAgent("")
    .getFromHtml(uri)
    .select("*")
);    

Cookie example

// ptt 八掛版創立首篇廢文標題
String uri = "https://www.ptt.cc/bbs/Gossiping/M.1119222611.A.7A9.html";

CrawlerPack.start()
    .addCookie("over18","1")  // 必需在 getFromXXX 前設定Cookie
    .getFromHtml(uri)
    .select("span:containsOwn(標題) + span:eq(1)").text();

Compressed data example (gzip/gz)

// 北市Youbike資訊
String uri = "gz:https://tcgbusfs.blob.core.windows.net/blobyoubike/YouBikeTP.gz";

// 列出所有大安區內的租借站
CrawlerPack.start()
    .getFromJson(uri)
    .select("retVal > *:contains(大安區)")

Compressed data example (zip)

// 內政部實價登錄
String uri = "zip:http://plvr.land.moi.gov.tw"
             + "/Download?type=zip&fileName=lvr_landxml.zip"
             + "!/A_LVR_LAND_A.XML";  // 解壓縮後取出的檔案路徑+名稱

// org.jsoup.select.Elements
Elements elems = CrawlerPack.start()
                    .getFromXml(uri)
                    .select("買賣");

for(Element elem: elems){
    System.out.println(
        elem.select("鄉鎮市區").text() +
        "," + elem.select("總價元").text()
    );
}

Tips

指定文件編碼

爬蟲包的主要目標,是提供簡易入門的操作模式。然而爬蟲包的效能並不理想,主要原因是編碼偵測 ,為了降低預設操作難度,使用了 juniversalchardet 自動偵測遠端內容編碼。直接指定遠端編碼可跳過自動偵測,提升一點效能。如果遠端為UTF8編碼 ,便不需要再指定。

以台灣證交所網站為例,若不指定編碼時,平均約600ms完成

// TWSE 2015'三大法人買賣金額統計表
String uri = "http://www.twse.com.tw/ch/trading/fund/BFI82U/BFI82U_print.php"
            +"?begin_date=20150101&end_date=20151231&report_type=month";

# Guava Stopwatch
Stopwatch timer = Stopwatch.createStarted();
CrawlerPack.start()
    .getFromHtml(uri);
System.out.println( timer.stop().toString() );
// avg 600ms 

指定遠端編碼為big5後,減少了一點時間,減少的時間,會與你的處理器效能有關

Stopwatch timer = Stopwatch.createStarted();
CrawlerPack.start()
    .setRemoteEncoding("big5")
    .getFromHtml(uri);
System.out.println( timer.stop().toString() );
// avg 480ms 

設定 User Agent

部份網站會使用 User-Agent 來阻擋GoogleBot或爬蟲。爬蟲包( >= 1.1)預設會偽裝為一般瀏覽器。

套件 預設User-Agent
Jsoup Java/1.8.0_20
Apache Commons VFS Jakarta-Commons-VFS
CrawlerPack Mozilla/5.0 (CrawlerPack; )

除錯 (CrawlerPack >= 1.1)

爬蟲包內預設除錯訊息等級為『Warn』,如果要調整除錯的等級,可依照下面範例調整

// set to debug level
CrawlerPack.setLoggerLevel(SimpleLog.LOG_LEVEL_DEBUG);
 
// turn off logging
CrawlerPack.setLoggerLevel(SimpleLog.LOG_LEVEL_OFF);

Change log

1.1.1

  • 修正 tar 檔無法正確打開的問題

1.1

  • 主要調動

    • 更新: Jsoup 套件版本至 1.9.2
    • 更新: JAVA-Json 套件版本至 20160212
    • 更新: 移除 Slf4j 套件需求
    • 調整: XML解析器改用原生 Jsoup XML parser (新版 Jsoup 已支援non-ASCII字元XML解析)
  • 新功能: static CrawlerPack.setLoggingLevel(int level) 可調整爬蟲包除錯訊息等級

  • 新功能: userAgent(String agent) 可調整userAgent的內容

  • 調整: 爬蟲包預設除錯訊息等級,調整至 Warn

  • 調整: 爬蟲包取得http/https,現在預設會加入userAgent資訊

  • 調整: 支援 gzip 壓縮格式的文字串流

1.0.3-1

  • 更新 Apache Commons-VFS 套件版本至 2.1

1.0.3

  • 修正(跳過) 壓縮格式無法取得字元長度的臭蟲

1.0.2

  • 修正抓取含路徑的打包檔時會出現 NullPointerException 問題
  • 修正自動編碼偵測造成資料清空的bug

1.0.1

  • 調整 getFromHtml 改使用 Jsoup 內建 Html parser
  • 增加自動編碼偵測功能 (add library juniversalchardet)
  • 增加 setRemoteEncoding(String encoding) 設定遠端內文編碼

1.0.0

  • 調整 api 操作界面
  • 增加對Cookie的支援

0.9.2

  • 修正解析註解以及 js 中特殊符號的錯誤
  • 修正動態網頁資料被cache的問題

0.9.1

  • 增加授權,使用Apache 2.0 公開授權
  • 專案已上傳至公開的 Maven Repository 現在可以直接透過pom.xml使用爬蟲包
  • 修正 https PKIX 驗證無法通過的問題

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].