All Projects → hrbrmstr → pdfbox

hrbrmstr / pdfbox

Licence: Apache-2.0 license
📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

Programming Languages

java
68154 projects - #9 most used programming language
r
7636 projects
Makefile
30231 projects

Projects that are alternatives of or similar to pdfbox

Pdfpig
Read and extract text and other content from PDFs in C# (port of PdfBox)
Stars: ✭ 391 (+750%)
Mutual labels:  pdf-files, pdf-document
Docnet
DocNET is as fast PDF editing and reading library for modern .NET applications
Stars: ✭ 128 (+178.26%)
Mutual labels:  pdf-files, pdf-document
Pdfio.jl
PDF Reader Library for Native Julia.
Stars: ✭ 56 (+21.74%)
Mutual labels:  pdf-files, pdf-document
Boxable
Boxable is a library that can be used to easily create tables in pdf documents.
Stars: ✭ 253 (+450%)
Mutual labels:  pdf-files, pdf-document
Pdfcpu
A PDF processor written in Go.
Stars: ✭ 2,852 (+6100%)
Mutual labels:  pdf-files
Combine pdf
A Pure ruby library to merge PDF files, number pages and maybe more...
Stars: ✭ 552 (+1100%)
Mutual labels:  pdf-files
Pdftools
Text Extraction, Rendering and Converting of PDF Documents
Stars: ✭ 349 (+658.7%)
Mutual labels:  pdf-files
pdf2html
pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.
Stars: ✭ 55 (+19.57%)
Mutual labels:  pdfbox
Pdfcompare
A simple Java library to compare two PDF files
Stars: ✭ 128 (+178.26%)
Mutual labels:  pdf-files
Pdf
Rust library to read, manipulate and write PDF files.
Stars: ✭ 265 (+476.09%)
Mutual labels:  pdf-files
Htmldoc
HTML Conversion Software
Stars: ✭ 99 (+115.22%)
Mutual labels:  pdf-files
Images To Pdf
An app to convert images to PDF file!
Stars: ✭ 602 (+1208.7%)
Mutual labels:  pdf-files
Traprange
(Java)A Method to Extract Tabular Content from PDF Files
Stars: ✭ 236 (+413.04%)
Mutual labels:  pdf-files
testarea-pdfbox2
Test area for public PDFBox v2 issues on stackoverflow etc
Stars: ✭ 58 (+26.09%)
Mutual labels:  pdfbox
Pybooks
python books
Stars: ✭ 87 (+89.13%)
Mutual labels:  pdf-files
pdfbox-docs
Mirror of Apache PDFBox Docs
Stars: ✭ 18 (-60.87%)
Mutual labels:  pdfbox
Hummusrecipe
A powerful PDF tool for NodeJS based on HummusJS.
Stars: ✭ 274 (+495.65%)
Mutual labels:  pdf-files
wayback
⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs
Stars: ✭ 52 (+13.04%)
Mutual labels:  r-cyber
PdfComponentComparison
The repository is used for comparing different pdf handle component such as Aspose.Pdf , Spire.Pdf and iText so on,you also can consider it as a demo repository that show how to use those component.
Stars: ✭ 39 (-15.22%)
Mutual labels:  pdf-document
MarkdownDoc
A Java tool/maven plugin/library to generate HMTL and PDF from markdown text intended for project documentation. Supports JSON based "stylesheet" for PDFs.
Stars: ✭ 21 (-54.35%)
Mutual labels:  pdfbox

Travis-CI Build Status Coverage Status CRAN_Status_Badge

pdfbox

Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

Description

I came across this thread (https://twitter.com/derekwillis/status/922138080043241473) and it looks like some misguided folks are going to help promote the use of PDF documents as a legit way to dissemiante data, which means that we’re likely to see more evil orgs and Government agencies try to use PDFs to hide data.

PDFs are barely useful as publication holders these days let alone data sources.

Apache PDFBox is a project that provides a comprehensive suite of tools to do things with and to PDF documents.

The aim here is to fill in any gaps in pdftools since poppler may not try to accommodate all the stupidity that we’re now likley to see.

What’s Inside The Tin

  • The ability to extract URI annotations

The following functions are implemented:

  • extract_uris: Extract URI annotations from a PDF document
  • extract_text: Extract text from a PDF document
  • pdf_info: Retrieve PDF Metadata

Installation

devtools::install_github("hrbrmstr/pdfboxjars")
devtools::install_github("hrbrmstr/pdfbox")

Usage

library(pdfbox)

# current verison
packageVersion("pdfbox")
## [1] '0.3.0'

PDF Info

pdf_info(
 system.file(
   "extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"
 )
) -> info

dplyr::glimpse(info)
## Observations: 1
## Variables: 7
## $ title             <chr> "Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice"
## $ subject           <chr> ""
## $ author            <chr> ""
## $ creation_date     <chr> "2015-08-21T11:06:23-04:00[GMT-04:00]"
## $ modification_date <chr> "2015-08-21T11:08:05-04:00[GMT-04:00]"
## $ producer          <chr> "pdfTeX-1.40.14"
## $ keywords          <chr> ""

Extract URI Annotations

extract_uris(
  system.file("extdata","imperfect-forward-secrecy-ccs15.pdf", package="pdfbox")
)
## # A tibble: 33 x 3
##     page uri                                                                    text                                    
##    <int> <chr>                                                                  <chr>                                   
##  1     1 https://weakdh.org                                                     WeakDH.org.                             
##  2     6 www.fbi.gov                                                            www.fbi.gov.                            
##  3    12 http://cr.yp.to/factorization/smoothparts-20040510.pdf                 http://cr.yp.to/factorization/smoothpar…
##  4    12 http://caramel.loria.fr/p180.txt                                       http://caramel.loria.fr/p180.txt.       
##  5    12 http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf         http://www.hyperelliptic.org/tanja/     
##  6    12 http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf         SHARCS/talks06/thorsten.pdf.            
##  7    13 https://www.olcf.ornl.gov/titan                                        https://www.olcf.ornl.gov/titan.        
##  8    13 http://www.spiegel.de/international/germany/inside-the-nsa-s-war-on-i… http://www.spiegel.de/international/ger…
##  9    13 http://www.spiegel.de/international/germany/inside-the-nsa-s-war-on-i… inside-the-nsa-s-war-on-internet-securi…
## 10    13 http://www.sagemath.org                                                http://www.sagemath.org.                
## # … with 23 more rows

Extract text

extract_text(
  system.file(
    "extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"
  )
) -> pg_df

dplyr::glimpse(pg_df)
## Observations: 13
## Variables: 2
## $ page <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
## $ text <chr> "Imperfect Forward Secrecy:\nHow Diffie-Hellman Fails in Practice\nDavid Adrian¶ Karthikeyan Bhargavan∗ …

pdfbox Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
Java 3 0.18 352 0.57 89 0.51 23 0.15
R 10 0.59 132 0.21 47 0.27 77 0.50
XML 1 0.06 69 0.11 0 0.00 0 0.00
Rmd 1 0.06 27 0.04 31 0.18 52 0.34
Maven 1 0.06 27 0.04 3 0.02 1 0.01
make 1 0.06 10 0.02 5 0.03 1 0.01

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].