All Projects → BobLd → Documentlayoutanalysis

BobLd / Documentlayoutanalysis

Document Layout Analysis resources repos for development with PdfPig.

Programming Languages

csharp
926 projects

Labels

Projects that are alternatives of or similar to Documentlayoutanalysis

Ambar
🔍 Ambar: Document Search Engine
Stars: ✭ 1,829 (+1057.59%)
Mutual labels:  pdf
Pdf Toolbox
A collection of tools for processing PDF files in Haskell
Stars: ✭ 145 (-8.23%)
Mutual labels:  pdf
Yii2 Export
A library to export server/db data in various formats (e.g. excel, html, pdf, csv etc.)
Stars: ✭ 153 (-3.16%)
Mutual labels:  pdf
Pdfcropmargins
pdfCropMargins -- a program to crop the margins of PDF files
Stars: ✭ 141 (-10.76%)
Mutual labels:  pdf
Decktape
PDF exporter for HTML presentations
Stars: ✭ 1,847 (+1068.99%)
Mutual labels:  pdf
Qtpdfium
Pdf Redening on Qt
Stars: ✭ 148 (-6.33%)
Mutual labels:  pdf
Pdfinverter
darken (or lighten) a PDF
Stars: ✭ 139 (-12.03%)
Mutual labels:  pdf
Allitebooks
爬取AllITeBook网站的书籍下载链接
Stars: ✭ 157 (-0.63%)
Mutual labels:  pdf
Doctron
Docker-powered html convert to pdf(html2pdf), html to image(html2image like jpeg,png),which using chrome(golang) kernel, add watermarks to pdf, convert pdf to images etc.
Stars: ✭ 141 (-10.76%)
Mutual labels:  pdf
Go Audio
An offline solution to convert pdfs into audiobooks
Stars: ✭ 153 (-3.16%)
Mutual labels:  pdf
Cs Books Pdf
编程电子书pdf,计算机常用电子书整理(高质量/附下载链接)包括 Java, Python, Linux, Go, C, C++, 数据结构与算法, AI人工智能, 计算机基础, 面试, 设计模式, 数据库, 前端等编程书籍。
Stars: ✭ 140 (-11.39%)
Mutual labels:  pdf
Pyecharts Snapshot
renders the output of pyecharts as png, jpeg, gif, svg, eps, pdf and raw base64
Stars: ✭ 142 (-10.13%)
Mutual labels:  pdf
H5 Transfer Pdf
H5TransferPDF是一个将网页HTML渲染为PDF和各种图像格式的API工具,完美兼容HTML、CSS、JS,较好的排版支持,并支持生成多种版本的PDF。
Stars: ✭ 149 (-5.7%)
Mutual labels:  pdf
Svglib
Read SVG files and convert them to other formats.
Stars: ✭ 139 (-12.03%)
Mutual labels:  pdf
It books
好书分享,送人玫瑰,手有余香。
Stars: ✭ 154 (-2.53%)
Mutual labels:  pdf
Educative.io Downloader
📖 This tool is to download course from educative.io for offline usage. It uses your login credentials and download the course.
Stars: ✭ 139 (-12.03%)
Mutual labels:  pdf
Zathura Pywal
🎨📖 A script that dynamically generates a zathura colorscheme based on the current wal colors.
Stars: ✭ 147 (-6.96%)
Mutual labels:  pdf
Lxxyxresume
前端简历生成器
Stars: ✭ 156 (-1.27%)
Mutual labels:  pdf
Pdfanno
Linguistic Annotation and Visualization Tool for PDF Documents
Stars: ✭ 156 (-1.27%)
Mutual labels:  pdf
Plagiarism Checker
A utility to check if a document's contents are plagiarised
Stars: ✭ 149 (-5.7%)
Mutual labels:  pdf

Document Layout Analysis repos for development with PdfPig.

From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

In this repos, we will not considere scanned documents, but classic pdf documents and leverage all available information (e.g. letters bounding boxes, fonts).

Related projects

Resources

Text extraction

Word segmentation

example

Page segmentation

Recursive XY Cut code PdfPig

The X-Y cut segmentation algorithm, also referred to as recursive X-Y cuts (RXYC) algorithm, is a tree-based top-down algorithm. The root of the tree represents the entire document page. All the leaf nodes together represent the final segmentation. The RXYC algorithm recursively splits the document into two or more smaller rectangular blocks which represent the nodes of the tree. At each step of the recursion, the horizontal and vertical projection profiles of each node are computed. Then, the valleys along the horizontal and vertical directions, VX and VY, are compared to corresponding predefined thresholds TX and TY. If the valley is larger than the threshold, the node is split at the mid-point of the wider of VX and VY into two children nodes. The process continues until no leaf node can be split further. Then, noise regions are removed using noise removal thresholds TnX and TnY. source example

Docstrum code PdfPig code PdfPig

The Docstrum algorithm by Gorman is a bottom-up approach based on nearest-neighborhood clustering of connected components extracted from the document image. After noise removal, the connected components are separated into two groups, one with dominant characters and another one with characters in titles and section heading, using a character size ratio factor fd. Then, K nearest neighbors are found for each connected component. Then, text-lines are found by computing the transitive closure on within-line nearest neighbor pairings using a threshold ft. Finally, text-lines are merged to form text blocks using a parallel distance threshold fpa and a perpendicular distance threshold fpe. source example or example

Voronoi

The Voronoi-diagram based segmentation algorithm by Kise et al. is also a bottom-up algorithm. In the first step, it extracts sample points from the boundaries of the connected components using a sampling rate sr. Then, noise removal is done using a maximum noise zone size threshold nm, in addition to width, height, and aspect ratio thresholds. After that the Voronoi diagram is generated using sample points obtained from the borders of the connected components. Superfluous Voronoi edges are deleted using a criterion involving the area ratio threshold ta, and the inter-line spacing margin control factor fr. Since we evaluate all algorithms on document pages with Manhattan layouts, a modified version of the algorithm is used to generate rectangular zones.source

Constrained text-line detection code PdfPig

The layout analysis approach by Breuel finds text-lines as a two step process:

  1. Find tall whitespace rectangles and evaluate them as candidates for gutters, column separators, etc. The algorithm for finding maximal empty whitespace is described in Breuel. The whitespace rectangles are returned in order of decreasing quality and are allowed a maximum overlap of Om.
  2. The whitespace rectangles representing the columns are used as obstacles in a robust least square, globally optimal text-line detection algorithm. Then, the bounding box of all the characters making the text-line is computed. The method was merely intended by its author as a demonstration of the application of two geometric algorithms, and not as a complete layout analysis system; nevertheless, we included it in the comparison because it has already proven useful in some applications. It is also nearly parameter free and resolution independent.source

PDF/A standard

PDF/A-1a compliant document make the following information available:

  1. Language specification
  2. Hierarchical document structure
  3. Tagged text spans and descriptive text for images and symbols
  4. Character mappings to Unicode

Zone classification/extraction & Reading order

Reading order

Table

Systems

Sparse line

Chart and diagram

Mathematical expression

Margins recognition

NLP & ML

Pre-trained models

Workshops

Related topics

Bounding boxes

Images

Shape detection

Character Recognition

Layout Similarity

Dehyphenation

Data structure

Datasets

Output file format

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Pdf page to image converter

A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.

Pdf layout analysis viewer

A Pdf layout analysis viewer is available, also relies on the mupdf library.

viewer

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].