All Projects β†’ pd3f β†’ pd3f

pd3f / pd3f

Licence: AGPL-3.0 license
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to pd3f

Bio embeddings
Get protein embeddings from protein sequences
Stars: ✭ 86 (-34.85%)
Mutual labels:  pipeline, language-model
Docotic.Pdf.Samples
C# and VB.NET samples for Docotic.Pdf library
Stars: ✭ 52 (-60.61%)
Mutual labels:  pdf-to-text, extract-text
Flowcraft
FlowCraft: a component-based pipeline composer for omics analysis using Nextflow. πŸ³πŸ“¦
Stars: ✭ 208 (+57.58%)
Mutual labels:  pipeline
assume-role-arn
πŸ€–πŸŽ©assume-role-arn allows you to easily assume an AWS IAM role in your CI/CD pipelines, without worrying about external dependencies.
Stars: ✭ 54 (-59.09%)
Mutual labels:  pipeline
Mipt Mips
Cycle-accurate pre-silicon simulator of RISC-V and MIPS CPUs
Stars: ✭ 250 (+89.39%)
Mutual labels:  pipeline
Hkube
🐟 High Performance Computing over Kubernetes - Core Repo 🎣
Stars: ✭ 214 (+62.12%)
Mutual labels:  pipeline
Al usdmaya
This repo is no longer updated. Please see https://github.com/Autodesk/maya-usd
Stars: ✭ 253 (+91.67%)
Mutual labels:  pipeline
Whispers
Identify hardcoded secrets and dangerous behaviours
Stars: ✭ 66 (-50%)
Mutual labels:  pipeline
targets-tutorial
Short course on the targets R package
Stars: ✭ 87 (-34.09%)
Mutual labels:  pipeline
Docker Android Build Box
An optimized docker image includes Android, Kotlin, Flutter sdk.
Stars: ✭ 245 (+85.61%)
Mutual labels:  pipeline
gofast
High performance transport protocol for distributed applications.
Stars: ✭ 19 (-85.61%)
Mutual labels:  pipeline
Cli
A CLI for interacting with Tekton!
Stars: ✭ 229 (+73.48%)
Mutual labels:  pipeline
Redispipe
High-throughput Redis client for Go with implicit pipelining
Stars: ✭ 215 (+62.88%)
Mutual labels:  pipeline
frizzle
The magic message bus
Stars: ✭ 14 (-89.39%)
Mutual labels:  pipeline
Bulk Writer
Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+59.09%)
Mutual labels:  pipeline
TF-NNLM-TK
A toolkit for neural language modeling using Tensorflow including basic models like RNNs and LSTMs as well as more advanced models.
Stars: ✭ 20 (-84.85%)
Mutual labels:  language-model
Shifu
An end-to-end machine learning and data mining framework on Hadoop
Stars: ✭ 207 (+56.82%)
Mutual labels:  pipeline
Automlpipeline.jl
A package that makes it trivial to create and evaluate machine learning pipeline architectures.
Stars: ✭ 223 (+68.94%)
Mutual labels:  pipeline
Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+91.67%)
Mutual labels:  pipeline
nextNEOpi
nextNEOpi: a comprehensive pipeline for computational neoantigen prediction
Stars: ✭ 42 (-68.18%)
Mutual labels:  pipeline

pd3f

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

  • calculate runtime based on job.started_at and job.ended_at
  • Get average runtime of jobs and store data in redis list

more information about PDF

  • NER
  • entity linking
  • extract keywords
  • use textacy

add more language

  • check if flair has model
  • what to do if there is no fast model?

Python client

  • simple client based on request
  • send whole folders

Markdown / HTML export

  • go beyond text

use pdf-scripts / allow more processing

  • reduce size
  • repair PDF
  • detect if scanned
  • force to OCR again

improve logs / get better feedback

  • show uncertainty of ML model
  • allow different log levels

Related Work

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].