All Projects → invoice-x → Invoice2data

invoice-x / Invoice2data

Licence: mit
Extract structured data from PDF invoices

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Invoice2data

Research
novel deep learning research works with PaddlePaddle
Stars: ✭ 609 (-35.42%)
Mutual labels:  data-mining
Spring2017 proffosterprovost
Introduction to Data Science
Stars: ✭ 18 (-98.09%)
Mutual labels:  data-mining
Vectorbt
Ultimate Python library for time series analysis and backtesting at scale
Stars: ✭ 855 (-9.33%)
Mutual labels:  data-mining
Nfstream
NFStream: a Flexible Network Data Analysis Framework.
Stars: ✭ 622 (-34.04%)
Mutual labels:  data-mining
Pyclustering
pyclustring is a Python, C++ data mining library.
Stars: ✭ 806 (-14.53%)
Mutual labels:  data-mining
Twitter Get Old Tweets Scraper
A data scraper for retrieving old tweets in Twitter using Python3.
Stars: ✭ 27 (-97.14%)
Mutual labels:  data-mining
Cookbook 2nd Code
Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]
Stars: ✭ 541 (-42.63%)
Mutual labels:  data-mining
Clevercsv
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
Stars: ✭ 887 (-5.94%)
Mutual labels:  data-mining
Biolitmap
Code for the paper "BIOLITMAP: a web-based geolocated and temporal visualization of the evolution of bioinformatics publications" in Oxford Bioinformatics.
Stars: ✭ 18 (-98.09%)
Mutual labels:  data-mining
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (-9.44%)
Mutual labels:  data-mining
Dataproofer
A proofreader for your data
Stars: ✭ 628 (-33.4%)
Mutual labels:  data-mining
Stocktalk
Data collection tool for social media analytics
Stars: ✭ 765 (-18.88%)
Mutual labels:  data-mining
Awesome Fraud Detection Papers
A curated list of data mining papers about fraud detection.
Stars: ✭ 843 (-10.6%)
Mutual labels:  data-mining
Elki
ELKI Data Mining Toolkit
Stars: ✭ 613 (-34.99%)
Mutual labels:  data-mining
Data mining
The Ruby DataMining Gem, is a little collection of several Data-Mining-Algorithms
Stars: ✭ 10 (-98.94%)
Mutual labels:  data-mining
Data Science With Ruby
Practical Data Science with Ruby based tools.
Stars: ✭ 549 (-41.78%)
Mutual labels:  data-mining
Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (-97.67%)
Mutual labels:  data-mining
Subdue
The Subdue graph miner discovers highly-compressing patterns in an input graph.
Stars: ✭ 20 (-97.88%)
Mutual labels:  data-mining
En Data mining
Data Mining Historical Newspaper Metadata (METS/ALTO formats)
Stars: ✭ 14 (-98.52%)
Mutual labels:  data-mining
Awesome Ai Books
Some awesome AI related books and pdfs for learning and downloading, also apply some playground models for learning
Stars: ✭ 855 (-9.33%)
Mutual labels:  data-mining

Data extractor for PDF invoices - invoice2data

invoice2data build status on GitHub Actions

A command line tool and Python library to support your accounting process.

  1. extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR -- tesseract, tesseract4 or gvision (Google Cloud Vision).
  2. searches for regex in the result using a YAML-based template system
  3. saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:

  • precisely match content PDF files
  • plugins available to match line items and tables
  • define static fields that are the same for every invoice
  • define custom fields needed in your organisation or process
  • have multiple regex per field (if layout or wording changes)
  • define currency
  • extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Installation

  1. Install pdftotext

If possible get the latest xpdf/poppler-utils version. It's included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext won't parse tables in PDF correctly.

  1. Install invoice2data using pip

    pip install invoice2data

Usage

Basic usage. Process PDF files and write result to CSV.

  • invoice2data invoice.pdf
  • invoice2data *.pdf

Choose any of the following input readers:

  • pdftotext invoice2data --input-reader pdftotext invoice.pdf
  • tesseract invoice2data --input-reader tesseract invoice.pdf
  • pdf miner invoice2data --input-reader pdfminer invoice.pdf
  • tesseract4 invoice2data --input-reader tesseract4 invoice.pdf
  • gvision invoice2data --input-reader gvision invoice.pdf (needs GOOGLE_APPLICATION_CREDENTIALS env var)

Choose any of the following output formats:

  • csv invoice2data --output-format csv invoice.pdf
  • json invoice2data --output-format json invoice.pdf
  • xml invoice2data --output-format xml invoice.pdf

Save output file with custom name or a specific folder

invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

Note: You must specify the output-format in order to create output-name

Specify folder with yml templates. (e.g. your suppliers)

invoice2data --template-folder ACME-templates invoice.pdf

Only use your own templates and exclude built-ins

invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf

Processes a folder of invoices and copies renamed invoices to new folder.

invoice2data --copy new_folder folder_with_invoices/*.pdf

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)

invoice2data --debug my_invoice.pdf

Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug

Use as Python Library

You can easily add invoice2data to your own Python scripts as library.

from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')

Using in-house templates

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)

Template system

See invoice2data/extract/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see TUTORIAL.md.

Templates are based on Yaml. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice processing.

Example:

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
exclude_keywords:
- San Jose
fields:
  amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
  invoice_number: Invoice Number:\s+(\d+)
  partner_name: (Amazon Web Services, Inc\.)
options:
  remove_whitespace: false
  currency: HKD
  date_formats:
    - '%d/%m/%Y'
lines:
    start: Detail
    end: \* May include estimated US sales tax
    first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
    line: (.*)\$(\d+\.\d+)
    last_line: VAT \*\*

Development

If you are interested in improving this project, have a look at our developer guide to get you started quickly.

Roadmap and open tasks

  • integrate with online OCR?
  • try to 'guess' parameters for new invoice formats.
  • can apply machine learning to guess new parameters?

Maintainers

Contributors

Related Projects

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].