Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → camelot-dev → Excalibur

camelot-dev / Excalibur

Licence: mit

A web interface to extract tabular data from PDFs

Labels

html pdf table extract

Projects that are alternatives of or similar to Excalibur

Camelot

Camelot: PDF Table Extraction for Humans

Stars: ✭ 3,150 (+243.89%)

Mutual labels: extract, pdf, table

Pdflayouttextstripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Stars: ✭ 1,369 (+49.45%)

Mutual labels: extract, pdf

Easytable

Small table drawing library built upon Apache PDFBox

Stars: ✭ 136 (-85.15%)

Mutual labels: pdf, table

Open Semantic Etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Stars: ✭ 165 (-81.99%)

Mutual labels: extract, pdf

Pdfsam

PDFsam, a desktop application to extract pages, split, merge, mix and rotate PDF files

Stars: ✭ 1,829 (+99.67%)

Mutual labels: extract, pdf

tabula-sharp

Extract tables from PDF files (port of tabula-java)

Stars: ✭ 38 (-95.85%)

Mutual labels: table, extract

Url To Pdf Api

Web page PDF/PNG rendering done right. Self-hosted service for rendering receipts, invoices, or any content.

Stars: ✭ 6,544 (+614.41%)

Mutual labels: pdf

Buka

Buka is a modern software that helps you manage your ebook at ease.

Stars: ✭ 896 (-2.18%)

Mutual labels: pdf

Go Pretty

Pretty print tables and more in golang!

Stars: ✭ 777 (-15.17%)

Mutual labels: table

Sumatrapdf

SumatraPDF reader

Stars: ✭ 7,462 (+714.63%)

Mutual labels: pdf

Markdown2document

turn markdown files to a PDF or HTML document

Stars: ✭ 22 (-97.6%)

Mutual labels: pdf

Extrakt

Extract .tar(.gz) using the system binary (fast!), with a javascript fallback (portable!)

Stars: ✭ 19 (-97.93%)

Mutual labels: extract

Laravel Bootstrap Table List

Bootstrap table list generator for Laravel.

Stars: ✭ 16 (-98.25%)

Mutual labels: table

Booktype

Booktype is a free, open source platform that produces beautiful, engaging books formatted for print, Amazon, iBooks and almost any ereader within minutes.

Stars: ✭ 810 (-11.57%)

Mutual labels: pdf

Chr

🔤 Lightweight R package for manipulating [string] characters

Stars: ✭ 18 (-98.03%)

Mutual labels: extract

Jsreport

javascript based business reporting platform 🚀

Stars: ✭ 798 (-12.88%)

Mutual labels: pdf

Itksoftwareguide

Sources for the ITKSoftwareGuide.

Stars: ✭ 19 (-97.93%)

Mutual labels: pdf

Readabilitykit

Preview extractor for news, articles and full-texts in Swift

Stars: ✭ 756 (-17.47%)

Mutual labels: extract

Jquery Rslitegrid

Input tabular data with your keyboard

Stars: ✭ 5 (-99.45%)

Mutual labels: table

Reactabular

A framework for building the React table you need (MIT)

Stars: ✭ 903 (-1.42%)

Mutual labels: table

View All Similar Projects ➔

Excalibur: A web interface to extract tabular data from PDFs

Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by Camelot.

Note: Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Using Excalibur

Note: You need to install ghostscript before moving forward.

After installing Excalibur with pip, you need to initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

That's it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.

Upload a PDF and enter the page numbers you want to extract tables from.
Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on "Autodetect tables" to see what Excalibur sees.)
Choose a flavor (Lattice or Stream) from "Advanced".

a. Lattice: For tables formed with lines.

b. Stream: For tables formed with whitespaces.
Click on "View and download data" to see the extracted tables.
Select your favorite format (CSV/Excel/JSON/HTML) and click on "Download"!

Note: You can also download executables for Windows and Linux from the releases page and run them directly!

Why Excalibur?

Extracting tables from PDFs is hard. A simple copy-and-paste from a PDF into an Excel doesn't preserve table structure. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excel files.
Excalibur uses Camelot under the hood, which gives you additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
You can save table extraction settings (like table areas) for a PDF once, and apply them on new PDFs to extract tables with similar structures.
You get complete control over your data. All file storage and processing happens on your own local or remote machine.
Excalibur can be configured with MySQL and Celery for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.

Installation

Using pip

After installing ghostscript, which is one of the requirements for Camelot (See install instructions), you can simply use pip to install Excalibur:

$ pip install excalibur-py

From the source code

After installing ghostscript, clone the repo using:

$ git clone https://www.github.com/camelot-dev/excalibur

and install Excalibur using pip:

$ cd excalibur
$ pip install .

Documentation

Fantastic documentation is available at http://excalibur-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/excalibur

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install excalibur-py[dev]

Testing (soon)

After installation, you can run tests using:

$ python setup.py test

Versioning

Excalibur uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Support the development

You can support our work on Excalibur with a one-time or monthly donation on OpenCollective. Organizations who use Excalibur can also sponsor the project for an acknowledgement on our official site and this README.

Special thanks to all the users and organizations that support Excalibur!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 916

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (69) 🔗