All Projects → BobLd → tabula-sharp

BobLd / tabula-sharp

Licence: MIT License
Extract tables from PDF files (port of tabula-java)

Programming Languages

C#
18002 projects

Projects that are alternatives of or similar to tabula-sharp

Xioc
Extract indicators of compromise from text, including "escaped" ones.
Stars: ✭ 148 (+289.47%)
Mutual labels:  extract, extraction
Camelot
Camelot: PDF Table Extraction for Humans
Stars: ✭ 3,150 (+8189.47%)
Mutual labels:  table, extract
Excalibur
A web interface to extract tabular data from PDFs
Stars: ✭ 916 (+2310.53%)
Mutual labels:  table, extract
pyxpdf
Fast and memory-efficient Python PDF Parser based on xpdf sources
Stars: ✭ 26 (-31.58%)
Mutual labels:  pdfparser
ng-mazdik
Angular UI component library
Stars: ✭ 86 (+126.32%)
Mutual labels:  table
AutoIt-Ripper
Extract AutoIt scripts embedded in PE binaries
Stars: ✭ 101 (+165.79%)
Mutual labels:  extraction
react-vt-table
Table realisation based on `react-window` library
Stars: ✭ 28 (-26.32%)
Mutual labels:  table
critical-css-widget
A browser widget to extract Critical CSS and Full CSS from a page. Can be used via the browser console.
Stars: ✭ 35 (-7.89%)
Mutual labels:  extract
SimCaptcha
✅ 一个简单易用的点触验证码 (前端+后端)
Stars: ✭ 49 (+28.95%)
Mutual labels:  netstandard
Portable VSC PlatformIO
Portable Version of VSC with PlatformIO for Windows
Stars: ✭ 15 (-60.53%)
Mutual labels:  extract
zauberlehrling
Collection of tools and ideas for splitting up big monolithic PHP applications in smaller parts.
Stars: ✭ 28 (-26.32%)
Mutual labels:  extraction
onvif-discovery
C# .NetStandard 2.0 library to discover ONVIF compliant devices
Stars: ✭ 29 (-23.68%)
Mutual labels:  netstandard
Mime
.NET wrapper for libmagic
Stars: ✭ 51 (+34.21%)
Mutual labels:  netstandard
ctable
C library to print nicely formatted tables
Stars: ✭ 13 (-65.79%)
Mutual labels:  table
Coding-Standards
Coding Guidelines for C#
Stars: ✭ 125 (+228.95%)
Mutual labels:  netstandard
PLzmaSDK
PLzmaSDK is (Portable, Patched, Package, cross-P-latform) Lzma SDK.
Stars: ✭ 28 (-26.32%)
Mutual labels:  extract
coq-simple-io
IO for Gallina
Stars: ✭ 21 (-44.74%)
Mutual labels:  extraction
WebsocketClientLite.PCL
websocket Client Lite PCL - Xaramrin
Stars: ✭ 22 (-42.11%)
Mutual labels:  netstandard
EPPlus4PHP
an easy-to-use excel library for php project which is compiled with peachpie. NOT FOR THE COMMON PHP PROJECT!
Stars: ✭ 15 (-60.53%)
Mutual labels:  netstandard
I3DShapesTool
Tool used for extracting the binary .i3d.shapes files used by the GIANTS engine
Stars: ✭ 15 (-60.53%)
Mutual labels:  extract

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java

Windows Linux Mac OS

  • Supports .NET 5, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.5, 4.51, 4.52, 4.6, 4.61, 4.62, 4.7
  • No java bindings

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

  • Uses PdfPig, and not PdfBox.
  • Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
  • The NurminenDetectionAlgorithm is replaced by SimpleNurminenDetectionAlgorithm, because it requieres an image management library.
  • Table results might be different because of the way PdfPig builds Letters bounding box.

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);
	
	// detect canditate table zones
	SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
	var regions = detector.Detect(page);
	
	IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
	List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
	var table = tables[0];
	var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);

	IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
	List<Table> tables = ea.Extract(page);
	var table = tables[0];
	var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

example

Lattice mode - SpreadsheetExtractionAlgorithm

example

HELP WANTED

  • The original java implementation uses STR trees in RectangleSpatialIndex. This is not the case here so it might be a bit slower. Any help implementing a similar approach is welcome.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].