All Projects → pdf-association → pdf-corpora

pdf-association / pdf-corpora

Licence: CC-BY-4.0 License
An index of PDF-centric corpora

Projects that are alternatives of or similar to pdf-corpora

Merge-PDF
My first PyPi Package. Merge Image and PDF files using customizations within a folder using the Command line.
Stars: ✭ 15 (-53.12%)
Mutual labels:  pdf-files, pdfs
pdfdir
Utilities to operate on lots of PDF files
Stars: ✭ 22 (-31.25%)
Mutual labels:  pdf-files
Unidoc
This repository has moved! https://github.com/unidoc/unipdf
Stars: ✭ 694 (+2068.75%)
Mutual labels:  pdf-files
pdfbox
📄◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)
Stars: ✭ 46 (+43.75%)
Mutual labels:  pdf-files
Pybooks
python books
Stars: ✭ 87 (+171.88%)
Mutual labels:  pdf-files
Android-KeepLearning
Just for learning android well.
Stars: ✭ 23 (-28.12%)
Mutual labels:  pdf-files
Combine pdf
A Pure ruby library to merge PDF files, number pages and maybe more...
Stars: ✭ 552 (+1625%)
Mutual labels:  pdf-files
krop
A simple graphical tool to crop the pages of PDF files, written in Python/Qt
Stars: ✭ 88 (+175%)
Mutual labels:  pdf-files
zowie
Adds Zotero "select" links to attachment files in a Zotero database on macOS, so that outside of Zotero, you can find the bibliographic entry to which a file belongs. (Only works for local storage, not linked attachments.)
Stars: ✭ 71 (+121.88%)
Mutual labels:  pdf-files
Traprange
(Java)A Method to Extract Tabular Content from PDF Files
Stars: ✭ 236 (+637.5%)
Mutual labels:  pdf-files
Pdfcpu
A PDF processor written in Go.
Stars: ✭ 2,852 (+8812.5%)
Mutual labels:  pdf-files
Htmldoc
HTML Conversion Software
Stars: ✭ 99 (+209.38%)
Mutual labels:  pdf-files
pdftricks
A simple, efficient application for small manipulations in PDF files using Ghostscript.
Stars: ✭ 69 (+115.63%)
Mutual labels:  pdf-files
Pdfio.jl
PDF Reader Library for Native Julia.
Stars: ✭ 56 (+75%)
Mutual labels:  pdf-files
molminer
Python library and command-line tool for extracting compounds from scientific literature. Written in Python.
Stars: ✭ 38 (+18.75%)
Mutual labels:  pdf-files
Images To Pdf
An app to convert images to PDF file!
Stars: ✭ 602 (+1781.25%)
Mutual labels:  pdf-files
ByteScout-SDK-SourceCode
ALL source code samples for ByteScout SDKs and Web API API products.
Stars: ✭ 24 (-25%)
Mutual labels:  pdf-files
Pdfcompare
A simple Java library to compare two PDF files
Stars: ✭ 128 (+300%)
Mutual labels:  pdf-files
PDFsuite
Python scripts, Automator Services and Quartz Filters for MacOS (OS X) that create, manipulate, and query PDF files
Stars: ✭ 131 (+309.38%)
Mutual labels:  pdf-files
android-doc-picker
A simple and easy to use documents Picker android library. Choose any documents like pdf, ppt, text, word or media files from your device
Stars: ✭ 37 (+15.63%)
Mutual labels:  pdfs

PDF Corpora

LinkedIn     Twitter Follow     YouTube Channel Subscribers

This index references a number of the more significant public corpora (data sets) that may contain both valid and invalid, real and synthetic PDF files, reflecting the realities of processing PDF files 'from the wild'. In addition, targeted test suites for specific PDF features or ISO subsets of PDF are also listed. It is not intended to be a list of every website where PDFs may be obtained.

This informative resource is freely provided to all interested PDF developers, end-users and researchers. It does not reflect endorsement of any organization, website or corpus. If you currate, maintain or identify other corpora that you believe might be useful to the PDF industry and is freely available please contact us. Some corpora may require registration in order to access.

CAUTION: like any file downloaded from the internet, good computer security and hygiene practices should always be employed as some of these corpora contain files that are malicious! Use at your own risk!

For groups interested in creating their own PDF-centric corpora using GitHub, please consider using Git LFS so that large files can be easily supported.

GovDocs1

This is a very well-studied and large corpus created in 2010, comprising almost 1 million documents from the USA .gov TLD, of which more than 231,000 are PDF documents. The above URL provides lots of detailed information and references. For reference, the 1000 ZIP files total about 308GB. As of November 2020, GovDocs1 has joined the AWS Open Data Sponsorship Program and data is now available directly in AWS S3 (see https://digitalcorpora.org/s3_browser.html#corpora/files/govdocs1/zipfiles/)

CommonCrawl.org

The Common Crawl corpus contains petabytes of data collected since 2008 and is the core data behind the Wayback Machine (https://web.archive.org/). It contains raw web page data, extracted metadata and text extractions and, of course, millions and millions of PDF files gathered from across the web.

NOTE: Based on published SafeDocs research, many PDFs in the CommonCrawl database are known to be truncated. See: T. Allison et al., “Building a Wide Reach Corpus,” in LangSec 2020, May 2020. http://spw20.langsec.org/papers/corpus_LangSec2020.pdf:

Common Crawl truncates files at 1 MB. If researchers require intact files, they must re-pull truncated files from the original websites. In the December, 2019 crawl, nearly 430,000 PDFs (22%) were truncated.

SafeDocs "Issue Tracker" Corpus

An outcome of the DARPA-funded SafeDocs research program, a large and growing corpus (>32K files, >31GB) collated by targeted deep-crawls of various issue trackers (e.g. Bugzilla, JIRA, GitHub) to extract PDF attachments on public bug reports for various well-known open-source PDF-aware implementations. These PDFs are not directly discoverable via standard internet search engines. By it's very nature, this corpus has a higher than normal quantity of unusual and malformed PDFs. Further technical details of this corpus can be found at https://www.pdfa.org/a-new-stressful-pdf-corpus/ and https://www.pdfa.org/stressful-pdf-corpus-grows/ as well as README files on the Apache Tika regression server.

NOTE: this unsanitized collated corpus contains a few PDFs that are known to trigger certain anti-malware/anti-virus programs.

FoxHex0ne Mutations

A set of mutated PDFs (and other document formats) created via mutation-based fuzzing. See also http://foxhex0ne.blogspot.com/2020/01/lets-continue-with-corpus-distillation.html https://web.archive.org/web/20200325183758/http://foxhex0ne.blogspot.com/2020/01/lets-continue-with-corpus-distillation.html.

OpenPreserve Foundation Format Corpus

The is a digitial preservation focused corpus which is openly-licensed, and covers a wide range of formats and creation tools.

The Archivist's PDF Cabinet of Horrors

A smaller sub-corpus of PDF test files, created for detecting PDF features that are generally undesireable in an archival setting.

Cal Poly Graphic Communications PDF/VT Test File Suite 1.0.1

This suite provides a collection of four sets of graphically-rich, robust and valid ISO 16612-2 PDF/VT-1 files targeting high-volume variable data printing. Each set comprises PDFs with 10, 100, 500, 1000, 5000, 10000, and 15000 records and can be useful for examining how PDF technology scales as PDF file size and page counts increase.

VeraPDF Test Suite for PDF/A

The veraPDF test corpus targets the ISO 19005 family of PDF/A specifications (Versions 1B, 1A, 2B, 2U, 2A, 3B, 3U, 3A) as well as a number of additional PDF test files for ISO 32000-1:2008 (PDF 1.7). This test suite complements the Isartor and Bavaria PDF/A-1b test suites and follows their test file pattern:

  • all test files are atomic;
  • they are self-documented via the document outlines; and
  • the naming pattern and the directory structure indicate relevant parts of ISO 19005 specifications

Isartor Test Suite for PDF/A-1b

This test suite comprises a set of files that are used to check the conformance with PDF/A-1. More precisely, the Isartor test suite can be used to "validate the validators": It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

BFO PDF/A Test Suite

A collection of about 40 PDF documents that should either pass or fail a conformance test against the specified ISO 19005 PDF/A profile. The description.txt file lists the reason they should pass or fail. This collection was inspired by the Isartor test suite and follows a similar layout with respect to test case names, including the section of the PDF/A specification to which each test refers. Unlike Isartor there are also valid documents which test a particular area of the specification.

Ghent Working Group (GWG) Output Test Suite

The Ghent Output Suite (currently v5.0) has been created for testing PDF processing in the graphic arts industry and to determine whether workflows are behaving as expected according to the ISO 15930 family of PDF/X standards. The PDF files provide a series of test patches that can be used by end users of graphic arts equipment as well as developers of applications that handle PDF files. This test suite is highly technical and a good understanding of ISO 15930 is essential.

PDF/UA Reference Suite

To serve as a reference for software developers and practitioners interested in best-practices for creating tagged and accessible PDF files, the PDF Association's PDF/UA Competence Center has posted a set of 10 PDF documents conforming to ISO 14289-1 PDF/UA-1.

Matterhorn Protocol 1.02

The PDF Association's PDF/UA Competence Center developed the Matterhorn Protocol as a list of all the possible ways to fail PDF/UA. Following the requirements of PDF/UA, the document consists of 31 checkpoints comprised of 136 Failure Conditions. The Matterhorn Protocol 1.02 (PDF, 339kB, 2014-06-26) is delivered as a PDF file conforming to PDF/UA-1 (ISO 14289-1) and to PDF/A-2a (ISO 19005-2) and is a reference-quality PDF/UA file.

Altona Test Suite

The Altona Test Suite is a set of highly technical PDF files and patches specifically designed for testing ISO 15930 PDF/X compliance and color accuracy including transparency blending, font handling, smooth shades, gray balance, overprinting, etc.

3D PDF Showcase

The 3D PDF showcase corpus provides about 20 PDFs containing different kinds of 3D content from various creators.

Google pdfium Regression Test Suite

A set of PDFs (both real and synthetic) used in regression testing Google's pdfium implementation used in Chrome and elsewhere.

Mozilla pdf.js Regression Test Suite

PDF.js is a PDF viewer supported by Mozilla that is built with HTML5.

Ghent Working Group (GWG) Processing Steps

Three sample PDF files containing ISO 19593-1 compliant processing step data (i.e. PDF optional content layers describing cut contours (die lines), varnish, braille, legends, etc.). These sample files are fully compliant with the ISO standard and serve to illustrate the concepts discussed in the standard.

Ground-truthed data sets for PDF table recognition

Two ground-truthed datasets of natively-digital PDF documents containing tables. These documents have been collected systematically from the European Union and US Government websites.

PDF-TREX Table Recognition and Extraction data set

The freely available PDF-TREX dataset is a standard dataset in the TREX (Table Recognition and EXtraction) field. The dataset contains 100 PDF documents and 164 tables having different layouts.

US National Library of Medicine - National Institutes of Health

A collection of scientific PDF publications included in the PubMed Central Open Access Subset (commercial use collection).

OpenLibrary.org

Scanned books and other publications can be downloaded in PDF format. Note that many such PDFs are quite large in size.

International Labour Organization

This is a corpus of approximately 730 legacy PDF documents (from 2008-), however it has somewhat limited variabilty in PDF technical and syntactic constructs.

Mikal's "pdfdb"

A database referenced from StackOverflow (https://stackoverflow.com/questions/14386393/pdf-specification-compliance-testing-sample-files) that is no longer directly available.

Artifex MuPDF Public Test Corpus

Public test corpus for Ghostscript/MuPDF, maintained by Artifex (https://mupdf.com/).

Qiqqa's "Evil Base" Test Corpus

Test corpus used by the Qiqqa PDF document management software for testing various PDF-centric processes (metadata extraction, text extraction and OCR for meta-search & ~-research, page rendering/viewing, ...).

WARNING: Be aware that this corpus includes malformed, invalid and malicious PDFs, which serve as an acid test for robustness testing production-level PDF processors. Cave canem.

Other formats

Internally PDF supports many so-called "nested formats", such as JPEG, JPEG 2000, JBIG2, ICC and font programs, as well as conversion from other formats. Thus sources of corpora in other formats may also be of interest to the broader PDF community. Note that PDF can and does technically limit the scope of what certain nested formats can contain, so do not assume that all files in these corpora are valid for nesting inside PDF! Always refer to the latest PDF specification (ISO 32000-2, PDF 2.0) for all technical requirements.

PRImA Labs

The University of Salford Pattern Recognition & Image Analysis Research Lab (PRImA) provide many image-based data sets "ranging from historical books and newspapers to contemporary documents", that have been "collected, ground-truthed and organised [as] a number of datasets which are available for research and/or personal use".

IEEE DataPort

The IEEE DataPort contains some open access data sets. Although mainly focused on machine learning, many data sets are image-based.

AWS Open Data Program

Both GovDocs1 and CommonCrawl are part of the AWS Open Data Program (see above), but there are also many other data sets (mostly image and video related). Data is stored in AWS S3.

Microsoft Research Open Data

The MSR Open Data datasets provide a convenient UI for selecting datasets based on format (file type) which includes PDF, docx and png. Datasets are stored in Azure.

ICC Profiles

PDF files can contain ICC profiles to provide device-independent definitions for color. As a result, all PDF viewers, PDF renderers and many other PDF processors need to have robust handling of ICC profiles. The following links provide ICC-based corpora that PDF developers may therefore find useful (both valid and invalid ICC profiles):

Legal

In accordance with Title 17 U.S.C. Section 107, the material in this document is distributed without profit to those who have an interest in understanding interoperabiltity of PDF files, including for research and educational purposes. If you wish to use the copyrighted material of others that is referenced in this document for purposes of your own that go beyond 'fair use', it is your responsibility to obtain permission from the relevant copyright owner.

The PDF Association does not warrant the accuracy, timeliness or completeness of the information contained in this document. All copyright and trademarks remain with their respective owners. If you have a particular complaint about something you’ve read here, please contact us.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].