All Projects → jfilter → pdf-scripts

jfilter / pdf-scripts

Licence: GPL-3.0 License
📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs

Programming Languages

shell
77523 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pdf-scripts

EnhanceDiskUtility
SIMBL plugin for Disk Utility that aims to enable Verify / Repair Permissions support
Stars: ✭ 17 (-48.48%)
Mutual labels:  verify, repair
Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (+36.36%)
Mutual labels:  ocr, compress
DAVAR-Lab-OCR
OCR toolbox from Davar-Lab
Stars: ✭ 402 (+1118.18%)
Mutual labels:  ocr
PXESetupWizard
PXE Setup Wizard. Netboot Debian, Ubuntu, System Rescue CD, FreeDOS and more.
Stars: ✭ 96 (+190.91%)
Mutual labels:  bash-script
go-captcha
Go Captcha is a behavioral captcha, which implements the generation of random verification text and the verification of click position information.
Stars: ✭ 86 (+160.61%)
Mutual labels:  verify
flutter-openpgp
OpenPGP for flutter made with golang for fast performance with support for android, ios, macos, linux, windows, web and hover
Stars: ✭ 35 (+6.06%)
Mutual labels:  verify
VerificationCode
简单的滑动验证码JS插件 图片验证码
Stars: ✭ 15 (-54.55%)
Mutual labels:  verify
TesseractStudio.Net
A free Windows graphical interface to the Tesseract 4.0 OCR engine.
Stars: ✭ 38 (+15.15%)
Mutual labels:  ocr
kindle-dict
English-Vietnamese Dictionary for Kindle
Stars: ✭ 19 (-42.42%)
Mutual labels:  bash-script
google-cloud-vision-php
A simple php wrapper for the google cloud vision API
Stars: ✭ 16 (-51.52%)
Mutual labels:  ocr
R2CNN
caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
Stars: ✭ 80 (+142.42%)
Mutual labels:  ocr
labelReader
Programmatically find and read labels using Machine Learning
Stars: ✭ 44 (+33.33%)
Mutual labels:  ocr
OSCP-Prep
Contained is all my reference material for my OSCP preparation. Designed to be a one stop shop for code, guides, command syntax, and high level strategy. One simple clone and you have access to some of the most popular tools used for pentesting.
Stars: ✭ 33 (+0%)
Mutual labels:  bash-script
gazou
Japanese OCR for Linux & Windows
Stars: ✭ 32 (-3.03%)
Mutual labels:  ocr
email-checker
Provides email verification on the go.
Stars: ✭ 116 (+251.52%)
Mutual labels:  verify
git-commands-workflows
🚀 All the git commands and workflows you need to know
Stars: ✭ 50 (+51.52%)
Mutual labels:  bash-script
l2cu
L²CU: LDraw Linux Command line Utility
Stars: ✭ 14 (-57.58%)
Mutual labels:  bash-script
tuterm
A better way to learn CLI programs.
Stars: ✭ 22 (-33.33%)
Mutual labels:  bash-script
MLKit
🌝 MLKit是一个强大易用的工具包。通过ML Kit您可以很轻松的实现文字识别、条码识别、图像标记、人脸检测、对象检测等功能。
Stars: ✭ 294 (+790.91%)
Mutual labels:  ocr
defaults.sh
 User Defaults Plist → Shell Script converter with Regex filtering
Stars: ✭ 20 (-39.39%)
Mutual labels:  bash-script

PDF Scripts

Scripts (mostly Bash) to repair, verify, OCR, compress (etc.) PDFs.

Currently in beta status, so except backward-incompatible changes.

Install

You need to have Bash installed.

The scripts use several software libraries. setup.sh installs them for macOS (via brew) or Ubuntu/Debian.

Usage

  1. Go to root of this repository: cd pdf-scripts
  2. Excute script ./pipeline.sh -l deu /path/to/document-in-german.pdf

Please refer to the scripts for the command-line arguments and options. NB: It's not possible to combine options, e.g., use -x -y instead of -xy.

Most scripts work on individual PDFs as well as on folders full of PDFs.

Overview

ocr_pdf.sh

OCR PDFs with OCRmyPDF.

repair_pdf.sh

Using: pdftocairo from poppler, mutool clean from MuPDF, qpdf

Caveat: May remove text in OCRd PDFs. Use --check to check for OCRd text in order to preserve it.

verify_pdf.sh

Checks if text can be extracted (if it's already on the PDF)

compress_pdf.sh

Using ghostcript to compress images in PDFs.

reduce_size_pdf.sh

Use compress_pdf.sh but also pdfsizeopt to reduze file size of PDFs.

clean_metadata_pdf.sh

Remove metadata with exiftool.

is_ocrd_pdf.sh

Detect OCRd PDFs. See also sort_ocrd_pdfs.sh to sort PDFs.

pipeline.sh

Combining several of the above scripts.

FAQ

Why Bash?

Bash is still the most-used shell. And the scipts comprise mostly of simple conditionals and sequences of CLI commands. This could also be done with Python's psutil but this would add yet another layer. However, at some point, I most probable port the scripts to simple POSIX-Shell.

Related Work

Development

  • focus on Bash v4+
  • write Python 3.6+ scripts if Bash gets too complicated
  • use Docker images if available
  • should run on the major Unix-like OSs (Linux (e.g. Ubuntu), macOS)
  • format code with shfmt, e.g., extension for VS Code
  • lint scripts with shellcheck, e.g., extension for VS Code

Common Commands

Concat PDFs into one PDF

qpdf --empty --pages *.pdf -- out.pdf

Images to PDF

convert *.jpg pictures.pdf

Rotate PDFs

qpdf in.pdf  out.pdf --rotate=+90

License

GPLv3.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].