All Projects → gendx → pdf-corpus

gendx / pdf-corpus

Licence: MIT license
Python script to quickly create hand-crafted PDF files

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to pdf-corpus

nytwit
New York Times Word Innovation Types dataset
Stars: ✭ 21 (+23.53%)
Mutual labels:  corpus
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (+11.76%)
Mutual labels:  corpus
TV4Dialog
No description or website provided.
Stars: ✭ 33 (+94.12%)
Mutual labels:  corpus
open2ch-dialogue-corpus
おーぷん2ちゃんねるをクロールして作成した対話コーパス
Stars: ✭ 65 (+282.35%)
Mutual labels:  corpus
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+200%)
Mutual labels:  corpus
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+441.18%)
Mutual labels:  corpus
json-path-comparison
Comparison of the different implementations of JSONPath and language agnostic test suite.
Stars: ✭ 64 (+276.47%)
Mutual labels:  test-suite
egret-wenda-corpus
A Public Corpus for Machine Learning
Stars: ✭ 41 (+141.18%)
Mutual labels:  corpus
mev-corpus
MEV Data Corpus
Stars: ✭ 77 (+352.94%)
Mutual labels:  corpus
LanguageCodes
We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+311.76%)
Mutual labels:  corpus
malay-dataset
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+1011.76%)
Mutual labels:  corpus
When-in-Rome
A meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (+105.88%)
Mutual labels:  corpus
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+376.47%)
Mutual labels:  corpus
gum
Repository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+317.65%)
Mutual labels:  corpus
jrte-corpus
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (+288.24%)
Mutual labels:  corpus
ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+429.41%)
Mutual labels:  corpus
fortran-compiler-tests
A collection of Fortran compiler bug examples and tests
Stars: ✭ 31 (+82.35%)
Mutual labels:  test-suite
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+2129.41%)
Mutual labels:  corpus
crystal-koans
The Crystal Programming Language Koans
Stars: ✭ 31 (+82.35%)
Mutual labels:  test-suite
TorXakis
A tool for Model Based Testing
Stars: ✭ 40 (+135.29%)
Mutual labels:  test-suite

PDF corpus

This project allows to quickly create hand-crafted PDF files. The main Python script pdf-corpus.py is an ad-hoc template engine to easily prototype new PDFs.

Installation

To compile the corpus, just make it (you need a Python interpreter). All .txt files contained in the corpus/ folder are then converted into PDFs.

Description

Each PDF in the corpus is described by a .txt file that indicates the template to use and the content to insert in the template. The following templates are defined, but you can easily create your own by tweaking the Python code.

  • contentstream: A simple document containing one page in A4 format. You define the graphic commands to put in the page's content stream (see my cheat sheet). For convenience, a font resource is declared as /F1.
  • objects: A lower level template to directly declare objects. Simple streams can be defined, for which the template computes the /Length field.

Available corpus

The corpus already contains some files. These examples are classified into the following categories.

  • corpus/contentstream/: Playing with graphics instructions.
  • corpus/name/: Escape sequences in names.
  • corpus/number/: How numbers are parsed.

If you want to learn more about how these examples work, you can have a look at my blog posts: introduction to PDF syntax. I also make one-page cheat sheet(s) about PDF. For further details you can also dive into the PDF specification.

Disclaimer

Once compiled, these example files may not be fully compliant with the specification. In particular, they may be interpreted differently by different PDF readers.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].