All Projects → jhy → Jsoup

jhy / Jsoup

Licence: mit
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Programming Languages

java
68154 projects - #9 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to Jsoup

Jquery Xpath
jQuery XPath plugin (with full XPath 2.0 language support)
Stars: ✭ 173 (-98.12%)
Mutual labels:  dom, xml, xpath
Fluentdom
A fluent api for working with XML in PHP
Stars: ✭ 327 (-96.44%)
Mutual labels:  dom, xml, xpath
Preact Markup
⚡️ Render HTML5 as VDOM, with Components as Custom Elements!
Stars: ✭ 167 (-98.18%)
Mutual labels:  parse, dom, xml
Didom
Simple and fast HTML and XML parser
Stars: ✭ 1,939 (-78.89%)
Mutual labels:  dom, xml, xpath
Xml
XML without worries
Stars: ✭ 35 (-99.62%)
Mutual labels:  dom, xml, xpath
Pugixml
Light-weight, simple and fast XML parser for C++ with XPath support
Stars: ✭ 2,809 (-69.41%)
Mutual labels:  dom, xml, xpath
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (-97.48%)
Mutual labels:  parse, jsoup, dom
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (-96.03%)
Mutual labels:  jsoup, xpath
Xpath
XPath package for Golang, supports HTML, XML, JSON document query.
Stars: ✭ 376 (-95.91%)
Mutual labels:  xml, xpath
Basex
BaseX Main Repository.
Stars: ✭ 515 (-94.39%)
Mutual labels:  xml, xpath
Dom4j
flexible XML framework for Java
Stars: ✭ 689 (-92.5%)
Mutual labels:  dom, xml
Xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Stars: ✭ 335 (-96.35%)
Mutual labels:  xml, xpath
Exist
eXist Native XML Database and Application Platform
Stars: ✭ 294 (-96.8%)
Mutual labels:  xml, xpath
Parsel
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Stars: ✭ 628 (-93.16%)
Mutual labels:  xml, xpath
Etree
parse and generate XML easily in go
Stars: ✭ 763 (-91.69%)
Mutual labels:  dom, xml
Camaro
camaro is an utility to transform XML to JSON, using Node.js binding to native XML parser pugixml, one of the fastest XML parser around.
Stars: ✭ 438 (-95.23%)
Mutual labels:  xml, xpath
Crawlerforreader
Android 本地网络小说爬虫,基于jsoup及xpath
Stars: ✭ 312 (-96.6%)
Mutual labels:  jsoup, xpath
Sirix
SirixDB is a temporal, evolutionary database system, which uses an accumulate only approach. It keeps the full history of each resource. Every commit stores a space-efficient snapshot through structural sharing. It is log-structured and never overwrites data. SirixDB uses a novel page-level versioning approach called sliding snapshot.
Stars: ✭ 638 (-93.05%)
Mutual labels:  xml, xpath
Amazon Mobile Sentiment Analysis
Opinion mining of Mobile reviews on Amazon platform
Stars: ✭ 19 (-99.79%)
Mutual labels:  xml, xpath
Html React Parser
📝 HTML to React parser.
Stars: ✭ 846 (-90.79%)
Mutual labels:  parse, dom

jsoup: Java HTML Parser

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe-list, to prevent XSS attacks
  • output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Build Status

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.

Getting started

  1. Download the latest jsoup jar (or add it to your Maven/Gradle build)
  2. Read the cookbook
  3. Enjoy!

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via the mailing list.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].