DatashareBetter analyze information, in all its forms
SumyModule for automatic summarization of text documents and HTML pages.
SrtA simple library for parsing, modifying, and composing SRT files.
BreadabilityReworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Php Apache TikaApache Tika bindings for PHP: extract text and metadata from documents, images and other formats
UnipdfGolang PDF library for creating and processing PDF files (pure go)
Pdfio.jlPDF Reader Library for Native Julia.
Tika PythonTika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
ArticleparseHeuristic text extraction from news sites in Python3
UnidocThis repository has moved! https://github.com/unidoc/unipdf
JustextHeuristic based boilerplate removal tool
Nlp[UNMANTEINED] Extract values from strings and fill your structs with nlp.
PdftoolsText Extraction, Rendering and Converting of PDF Documents
ocrSimple app to extract text from pictures using Tesseract
trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
mobipython based software to unpack kindlegen generated ebooks
pd3f🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based