axa-group / Parsr

Licence: apache-2.0

Transforms PDF, Documents and Images into Enriched Structured Data

Programming Languages

javascript

184084 projects - #8 most used programming language

typescript

32286 projects

Projects that are alternatives of or similar to Parsr

Scanbot Sdk Example Android

Document scanning SDK example apps for the Scanbot SDK for Android.

Stars: ✭ 67 (-97.55%)

Mutual labels: document, pdf, ocr

Pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Stars: ✭ 1,969 (-28.03%)

Mutual labels: pdf, ocr

Openpdf

OpenPDF is a free Java library for creating and editing PDF files with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository. ⛺

Stars: ✭ 2,174 (-20.54%)

Mutual labels: hacktoberfest, pdf

Free Ai Resources

🚀 FREE AI Resources - 🎓 Courses, 👷 Jobs, 📝 Blogs, 🔬 AI Research, and many more - for everyone!

Stars: ✭ 192 (-92.98%)

Mutual labels: hacktoberfest, data

Ambar

🔍 Ambar: Document Search Engine

Stars: ✭ 1,829 (-33.15%)

Mutual labels: pdf, ocr

Imgpush

Minimalist Self-hosted Image Service for user submitted images in your app

Stars: ✭ 144 (-94.74%)

Mutual labels: hacktoberfest, images

Hms Ml Demo

HMS ML Demo provides an example of integrating Huawei ML Kit service into applications. This example demonstrates how to integrate services provided by ML Kit, such as face detection, text recognition, image segmentation, asr, and tts.

Stars: ✭ 187 (-93.17%)

Mutual labels: document, ocr

Textrecognitiondatagenerator

A synthetic data generator for text recognition

Stars: ✭ 2,075 (-24.16%)

Mutual labels: data, ocr

React Native Pdfview

📚 PDF viewer for React Native

Stars: ✭ 198 (-92.76%)

Mutual labels: hacktoberfest, pdf

Paperwork

Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab

Stars: ✭ 2,392 (-12.57%)

Mutual labels: pdf, ocr

Pdf Lib

Create and modify PDF documents in any JavaScript environment

Stars: ✭ 3,426 (+25.22%)

Mutual labels: document, pdf

Educative.io Downloader

📖 This tool is to download course from educative.io for offline usage. It uses your login credentials and download the course.

Stars: ✭ 139 (-94.92%)

Mutual labels: hacktoberfest, pdf

Datasets

🎁 3,000,000+ Unsplash images made available for research and machine learning

Stars: ✭ 1,805 (-34.03%)

Mutual labels: data, images

Lambda Text Extractor

AWS Lambda functions to extract text from various binary formats.

Stars: ✭ 159 (-94.19%)

Mutual labels: pdf, ocr

Stegbrute

Fast Steganography bruteforce tool written in Rust useful for CTF's

Stars: ✭ 134 (-95.1%)

Mutual labels: hacktoberfest, images

Open Semantic Etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Stars: ✭ 165 (-93.97%)

Mutual labels: pdf, ocr

Open Paperless

Scan, index, and archive all of your paper documents (acquired by Mayan EDMS)

Stars: ✭ 2,538 (-7.24%)

Mutual labels: pdf, ocr

Geeksforgeeksscrapper

Scrapes g4g and creates PDF

Stars: ✭ 124 (-95.47%)

Mutual labels: hacktoberfest, pdf

Etherpad Lite

Etherpad: A modern really-real-time collaborative document editor.

Stars: ✭ 11,937 (+336.29%)

Mutual labels: document, pdf

Climate Change Data

🌍 A curated list of APIs, open data and ML/AI projects on climate change

Stars: ✭ 195 (-92.87%)

Mutual labels: hacktoberfest, data

View All Similar Projects ➔

Turn your documents into data!

Français | Portuguese | Spanish | 中文

Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.
It provides analysis, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents
Getting Started
- Installation
- Usage
Documentation
Contribute
Third Party Licenses
License

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

To access the python client to Parsr API, issue:
```
pip install parsr-client
```
To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

QPDF: Apache http://qpdf.sourceforge.net
ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
Camelot: MIT https://github.com/camelot-dev/camelot
MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

axa-group / Parsr

Programming Languages

Labels

Projects that are alternatives of or similar to Parsr

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License