All Projects → atlanhq → Camelot

atlanhq / Camelot

Licence: other
Camelot: PDF Table Extraction for Humans

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to Camelot

Excalibur
A web interface to extract tabular data from PDFs
Stars: ✭ 916 (-70.92%)
Mutual labels:  extract, pdf, table
Easytable
Small table drawing library built upon Apache PDFBox
Stars: ✭ 136 (-95.68%)
Mutual labels:  pdf, table
Pdfsam
PDFsam, a desktop application to extract pages, split, merge, mix and rotate PDF files
Stars: ✭ 1,829 (-41.94%)
Mutual labels:  extract, pdf
Open Semantic Etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Stars: ✭ 165 (-94.76%)
Mutual labels:  extract, pdf
Pdflayouttextstripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Stars: ✭ 1,369 (-56.54%)
Mutual labels:  extract, pdf
tabula-sharp
Extract tables from PDF files (port of tabula-java)
Stars: ✭ 38 (-98.79%)
Mutual labels:  table, extract
Deck
Slide Decks
Stars: ✭ 261 (-91.71%)
Mutual labels:  pdf
Starter Book
A book starter to kickstart your writing journey 🎉
Stars: ✭ 277 (-91.21%)
Mutual labels:  pdf
Uxmpdfkit
An iOS PDF viewer and annotator written in Swift that can be embedded into any application.
Stars: ✭ 260 (-91.75%)
Mutual labels:  pdf
Receipts
Easy receipts and invoices for your Rails applications
Stars: ✭ 259 (-91.78%)
Mutual labels:  pdf
Flextable
table farming
Stars: ✭ 288 (-90.86%)
Mutual labels:  table
Pdf Flipbook
Browse PDF document like a book turning its pages
Stars: ✭ 279 (-91.14%)
Mutual labels:  pdf
Hummusrecipe
A powerful PDF tool for NodeJS based on HummusJS.
Stars: ✭ 274 (-91.3%)
Mutual labels:  pdf
Gridjs
Advanced table plugin
Stars: ✭ 3,231 (+2.57%)
Mutual labels:  table
Quickbill
Create unlimited invoices for free.
Stars: ✭ 278 (-91.17%)
Mutual labels:  pdf
Tableexport
tableExport(table导出文件,支持json、csv、txt、xml、word、excel、image、pdf)
Stars: ✭ 261 (-91.71%)
Mutual labels:  pdf
Pdfocr
Adds text to PDF files using the cuneiform OCR software
Stars: ✭ 287 (-90.89%)
Mutual labels:  pdf
Shaark
Self-hosted platform to keep and share your content: web links, posts, passwords and pictures.
Stars: ✭ 258 (-91.81%)
Mutual labels:  pdf
Thinreports Generator
Report Generator for Ruby
Stars: ✭ 268 (-91.49%)
Mutual labels:  pdf
Reptile
爬取机械工业出版社所有的计算机方面的书
Stars: ✭ 282 (-91.05%)
Mutual labels:  pdf

Camelot: PDF Table Extraction for Humans

Build Status Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py[cv]

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].