All Projects → hrbrmstr → jericho

hrbrmstr / jericho

Licence: Apache-2.0 license
📔 Extract plain or structured text from HTML content in R

Programming Languages

r
7636 projects
java
68154 projects - #9 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to jericho

warc
📇 Tools to Work with the Web Archive Ecosystem in R
Stars: ✭ 21 (+50%)
Mutual labels:  r-cyber
greynoise
Query 'GreyNoise Intelligence 'API' in R
Stars: ✭ 15 (+7.14%)
Mutual labels:  r-cyber
xattrs
🗃 Work With Filesystem Object Extended Attributes — https://hrbrmstr.github.io/xattrs/index.html
Stars: ✭ 17 (+21.43%)
Mutual labels:  r-cyber
urlscan
👀 Analyze Websites and Resources They Request
Stars: ✭ 21 (+50%)
Mutual labels:  r-cyber
htmlunit
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
Stars: ✭ 39 (+178.57%)
Mutual labels:  r-cyber
mhn
🍯 Analyze and Visualize Data from Modern Honey Network Servers with R
Stars: ✭ 16 (+14.29%)
Mutual labels:  r-cyber
reapr
🕸→ℹ️ Reap Information from Websites
Stars: ✭ 14 (+0%)
Mutual labels:  r-cyber
gdns
Tools to work with the Google DNS over HTTPS API in R
Stars: ✭ 23 (+64.29%)
Mutual labels:  r-cyber
curlconverter
➰ ➡️ ➖ Translate cURL command lines into parameters for use with httr or actual httr calls (R)
Stars: ✭ 86 (+514.29%)
Mutual labels:  r-cyber
pdfbox
📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)
Stars: ✭ 46 (+228.57%)
Mutual labels:  r-cyber
wayback
⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs
Stars: ✭ 52 (+271.43%)
Mutual labels:  r-cyber
shodan
🌑 R package to work with the Shodan API
Stars: ✭ 16 (+14.29%)
Mutual labels:  r-cyber
webhose
🔨 Tools to Work with the 'webhose.io' 'API' in R
Stars: ✭ 12 (-14.29%)
Mutual labels:  r-cyber

Build Status Build status codecov

jericho : Break Down the Walls of ‘HTML’ Tags into Usable Text

Structured ‘HTML’ content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain “just the text” from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the ‘Jericho HTML Parser’ Java library by Martin Jericho http://jericho.htmlparser.net/docs/index.html. Martin’s library is used in many at-scale projects, icluding the ‘The Internet Archive’.

As a result of using a Java library, this package requires rJava.

The following functions are implemented:

  • html_to_text: Convert HTML to Text
  • render_html_to_text: Render HTML to Text

Installation

If you do use devtools, then it should pickup the Remotes: section in DESCRIPTION. Until the package is on CRAN, you might want to also invoke the installation of jerichojars as shown below:

install.packages(c("jerichojars", "jericho"), repos = "https://cinc.rud.is/")

Usage

Let’s use this NASA blog post as an example.

library(jericho)

# current verison
packageVersion("jericho")
## [1] '0.2.0'
URL <- "https://blogs.nasa.gov/spacestation/2017/09/02/touchdown-expedition-52-back-on-earth/"
  
doc <- paste0(readr::read_lines(URL), collapse = "\n")

This is pure text extraction:

html_to_text(doc)

This provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.

render_html_to_text(doc)

You should run each to see and compare the output (GitHub markdown documents aren’t the best viewing medium).

jericho Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
Java 2 0.18 49 0.38 9 0.19 14 0.13
R 6 0.55 40 0.31 10 0.21 62 0.56
Maven 1 0.09 23 0.18 1 0.02 1 0.01
Rmd 1 0.09 9 0.07 24 0.50 33 0.30
make 1 0.09 8 0.06 4 0.08 0 0.00
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].