Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ropensci → Pdftools

ropensci / Pdftools

Licence: other

Text Extraction, Rendering and Converting of PDF Documents

Programming Languages

7636 projects

Labels

rstats r-package text-extraction pdf-files

Projects that are alternatives of or similar to Pdftools

mutant

mutation testing for R

Stars: ✭ 13 (-96.28%)

Mutual labels: rstats, r-package

Targets

Function-oriented Make-like declarative workflows for R

Stars: ✭ 293 (-16.05%)

Mutual labels: r-package, rstats

roadoi

Use Unpaywall with R

Stars: ✭ 60 (-82.81%)

Mutual labels: rstats, r-package

Ggpointdensity

📈 📊 Introduces geom_pointdensity(): A Cross Between a Scatter Plot and a 2D Density Plot.

Stars: ✭ 286 (-18.05%)

Mutual labels: r-package, rstats

Rselenium

An R client for Selenium Remote WebDriver

Stars: ✭ 278 (-20.34%)

Mutual labels: r-package, rstats

getCRUCLdata

CRU CL v. 2.0 Climatology Client for R

Stars: ✭ 17 (-95.13%)

Mutual labels: rstats, r-package

Rnoaa

R interface to many NOAA data APIs

Stars: ✭ 278 (-20.34%)

Mutual labels: r-package, rstats

schtools

Schloss Lab Tools for Reproducible Microbiome Research 💩

Stars: ✭ 22 (-93.7%)

Mutual labels: rstats, r-package

kaggler

🏁 API client for Kaggle

Stars: ✭ 50 (-85.67%)

Mutual labels: rstats, r-package

wikitaxa

taxonomy data from Wikipedia/Wikidata/Wikispecies

Stars: ✭ 16 (-95.42%)

Mutual labels: rstats, r-package

rsnps

Wrapper to a number of SNP web APIs

Stars: ✭ 44 (-87.39%)

Mutual labels: rstats, r-package

Rhub

R-hub API client

Stars: ✭ 292 (-16.33%)

Mutual labels: r-package, rstats

rredlist

IUCN Red List API Client

Stars: ✭ 31 (-91.12%)

Mutual labels: rstats, r-package

dotwhisker

Dot-and-Whisker Plots of Regression Results

Stars: ✭ 51 (-85.39%)

Mutual labels: rstats, r-package

checkers

⛔ ARCHIVED ⛔ Automated checking of best practices for research compendia ✔️

Stars: ✭ 53 (-84.81%)

Mutual labels: rstats, r-package

bcmaps

An R package of map layers for British Columbia

Stars: ✭ 53 (-84.81%)

Mutual labels: rstats, r-package

rdryad

R client for Dryad web services

Stars: ✭ 25 (-92.84%)

Mutual labels: rstats, r-package

miner

R package for controlling Minecraft via API

Stars: ✭ 74 (-78.8%)

Mutual labels: rstats, r-package

worrms

World Register of Marine Species R client

Stars: ✭ 13 (-96.28%)

Mutual labels: rstats, r-package

Ggextra

📊 Add marginal histograms to ggplot2, and more ggplot2 enhancements

Stars: ✭ 299 (-14.33%)

Mutual labels: r-package, rstats

View All Similar Projects ➔

pdftools

Introduction

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Installation

On Windows and Mac the binary packages can be installed directly from CRAN:

install.packages("pdftools")

Installation on Linux requires the poppler development library. On Ubuntu 16.04 (Xenial) and Ubuntu 18.04 (Bionic) we have backports that support the latest pdf_data() functionality:

sudo add-apt-repository -y ppa:cran/poppler
sudo apt-get update
sudo apt-get install -y libpoppler-cpp-dev

On other versions of Debian or Ubuntu simply use::

sudo apt-get install libpoppler-cpp-dev

If you want to install the package from source on MacOS you need brew:

brew install poppler

On Fedora:

sudo yum install poppler-cpp-devel

Building from source

On Ubuntu

Update: Itt is now recommended to use the backport PPA mentioned above. If you really want to build from source, follow the instructions of this askubuntu.com answer.

On CentOS

On CentOS the libpoppler-cpp library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.

# Build dependencies
yum install wget xz libjpeg-devel openjpeg2-devel

# Download and extract
wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
tar -Jxvf poppler-0.47.0.tar.xz
cd poppler-0.47.0

# Build and install
./configure
make
sudo make install

By default libraries get installed in /usr/local/lib and /usr/local/include. On CentOS this is not a default search path so we need to set PKG_CONFIG_PATH and LD_LIBRARY_PATH to point R to the right directory:

export LD_LIBRARY_PATH="/usr/local/lib"
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"

We can then start R and install pdftools.

Getting started

The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc
info <- pdf_info("1403.2805.pdf")

# Table with fonts
fonts <- pdf_fonts("1403.2805.pdf")

Bonus feature: rendering pdf

A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")

This feature is still experimental and currently does not work on Windows.

Limitations and related packages

Tables

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools.

txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")

# some tables
cat(txt[18])
cat(txt[19])

The tabulizer package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer depends on rJava and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.

It is possible to use pdftools with some creativity to parse tables from PDF documents, which does not require Java to be installed.

Scanned text

If you want to extract text from scanned text present in a pdf, you'll need to use OCR (optical character recognition). Please refer to the rOpenSci tesseract package that provides bindings to the Tesseract OCR engine. In particular read the section of its vignette about reading from PDF files using pdftools and tesseract.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 349

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (36) 🔗