Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Stars: ✭ 165 (-90.98%)

Mutual labels: pdf, ocr

Pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Stars: ✭ 1,969 (+7.65%)

Mutual labels: pdf, ocr

Open Paperless

Scan, index, and archive all of your paper documents (acquired by Mayan EDMS)

Stars: ✭ 2,538 (+38.76%)

Mutual labels: pdf, ocr

Lambda Text Extractor

AWS Lambda functions to extract text from various binary formats.

Stars: ✭ 159 (-91.31%)

Mutual labels: pdf, ocr

Lolcate Rs

Lolcate -- A comically fast way of indexing and querying your filesystem. Replaces locate / mlocate / updatedb. Written in Rust.

Stars: ✭ 191 (-89.56%)

Mutual labels: search, search-engine

Rusticsearch

Lightweight Elasticsearch compatible search server.

Stars: ✭ 171 (-90.65%)

Mutual labels: search, search-engine

Tntsearch

A fully featured full text search engine written in PHP

Stars: ✭ 2,693 (+47.24%)

Mutual labels: search, search-engine

Blast

Blast is a full text search and indexing server, written in Go, built on top of Bleve.

Stars: ✭ 934 (-48.93%)

Mutual labels: search, search-engine

Whoogle Search

A self-hosted, ad-free, privacy-respecting metasearch engine

Stars: ✭ 4,645 (+153.96%)

Mutual labels: search, search-engine

Magnetissimo

Web application that indexes all popular torrent sites, and saves it to the local database.

Stars: ✭ 2,551 (+39.48%)

Mutual labels: self-hosted, search-engine

Pdfocr

Adds text to PDF files using the cuneiform OCR software

Stars: ✭ 287 (-84.31%)

Mutual labels: pdf, ocr

Search Ui

🔍 A set of UI components to build a fully customized search!

Stars: ✭ 24 (-98.69%)

Mutual labels: search, search-engine

Searx

Privacy-respecting metasearch engine

Stars: ✭ 10,074 (+450.79%)

Mutual labels: search, search-engine

Simpleaudioindexer

Searching for the occurrence seconds of words/phrases or arbitrary regex patterns within audio files

Stars: ✭ 100 (-94.53%)

Mutual labels: search, search-engine

Search Online

🔍A simple extension for VSCode to search online easily using search engine.

Stars: ✭ 115 (-93.71%)

Mutual labels: search, search-engine

Cpp Image Analysis

DataCore bot image analysis component

Stars: ✭ 125 (-93.17%)

Mutual labels: ocr

Rapipdf

PDF generation from OpenAPI / Swagger Spec

Stars: ✭ 132 (-92.78%)

Mutual labels: pdf

Ffind

A sane replacement for find

Stars: ✭ 124 (-93.22%)

Mutual labels: search

Pinry

The open-source core of Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim through format.

Stars: ✭ 1,819 (-0.55%)

Mutual labels: self-hosted

Transformer str

PyTorch implementation of my new method for Scene Text Recognition (STR) based on Transformer,Equipped with Transformer, this method outperforms the best model of the aforementioned deep-text-recognition-benchmark by 7.6% on CUTE80.

Stars: ✭ 131 (-92.84%)

Mutual labels: ocr

Lucenenet

Apache Lucene.NET

Stars: ✭ 1,704 (-6.83%)

Mutual labels: search

Search Engine Optimization

🔍 A helpful checklist/collection of Search Engine Optimization (SEO) tips and techniques.

Stars: ✭ 1,798 (-1.69%)

Mutual labels: search

Geeksforgeeksscrapper

Scrapes g4g and creates PDF

Stars: ✭ 124 (-93.22%)

Mutual labels: pdf

Ptext Release

pText is a library for reading, creating and manipulating PDF files in python.

Stars: ✭ 124 (-93.22%)

Mutual labels: pdf

Algoliasearch Client Python

⚡️ A fully-featured and blazing-fast Python API client to interact with Algolia.

Stars: ✭ 138 (-92.45%)

Mutual labels: search

Alfred Ocr

OCR & Translate using multiple interfaces for Alfred Workflow

Stars: ✭ 136 (-92.56%)

Mutual labels: ocr

Pdfview Android

Small Android library to show PDF files

Stars: ✭ 132 (-92.78%)

Mutual labels: pdf

The Economist Ebooks

经济学人(含音频)、纽约客、自然、新科学人、卫报、科学美国人、连线、大西洋月刊、新闻周刊、国家地理等英语杂志免费下载、订阅(kindle推送),支持epub、mobi、pdf格式, 每周更新. The Economist 、The New Yorker 、Nature、The Atlantic 、New Scientist、The Guardian、Scientific American、Wired、Newsweek magazines, free download and subscription for kindle, mobi、epub、pdf format.

Stars: ✭ 3,471 (+89.78%)

Mutual labels: pdf

Querqy

Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)

Stars: ✭ 122 (-93.33%)

Mutual labels: search-engine

Documents

收集的程序开发相关的书籍与文档，多数为 PDF 格式文件，欢迎 fork 和 star。

Stars: ✭ 130 (-92.89%)

Mutual labels: pdf

Pdf2image

A utility for converting pdf to image and base64 format.

Stars: ✭ 122 (-93.33%)

Mutual labels: pdf

Endesive

en-crypt, de-crypt, si-gn, ve-rify - smime, pdf, xades and plain files in pure python

Stars: ✭ 122 (-93.33%)

Mutual labels: pdf

Vue Innersearch

🔎 UI components built with Vue.js for ElasticSearch

Stars: ✭ 135 (-92.62%)

Mutual labels: search

Algoliasearch Magento 2

Algolia Search integration for Magento 2 - compatible with versions from 2.3.x to 2.4.x

Stars: ✭ 131 (-92.84%)

Mutual labels: search

Typefont

The first open-source library that detects the font of a text in a image.

Stars: ✭ 1,575 (-13.89%)

Mutual labels: ocr

Pdfboxing

Nice wrapper of PDFBox in Clojure

Stars: ✭ 122 (-93.33%)

Mutual labels: pdf

Spacextract

Extraction and analysis of telemetry from rocket launch webcasts (from SpaceX and RocketLab)

Stars: ✭ 131 (-92.84%)

Mutual labels: ocr

Tesseract Ocr for windows

Visual Studio Projects for Tessearct and dependencies

Stars: ✭ 122 (-93.33%)

Mutual labels: ocr

Search Engine Google

🕷 Google client for SERPS

Stars: ✭ 138 (-92.45%)

Mutual labels: search-engine

Youtube Scrape

Scrape YouTube searches (API)

Stars: ✭ 122 (-93.33%)

Mutual labels: search

Robin

RObust document image BINarization

Stars: ✭ 131 (-92.84%)

Mutual labels: ocr

Idcardocr china

基于tesseract，实现摄像头扫描识别中国二代身份证

Stars: ✭ 122 (-93.33%)

Mutual labels: ocr

Chromehtmltopdf

Convert HTML to PDF with Chrome

Stars: ✭ 122 (-93.33%)

Mutual labels: pdf

Collector Http

Norconex HTTP Collector is a flexible web crawler for collecting, parsing, and manipulating data from the Internet (or Intranet) to various data repositories such as search engines.

Stars: ✭ 130 (-92.89%)

Mutual labels: search-engine

Trienet

.NET Implementations of Trie Data Structures for Substring Search, Auto-completion and Intelli-sense. Includes: patricia trie, suffix trie and a trie implementation using Ukkonen's algorithm.

Stars: ✭ 122 (-93.33%)

Mutual labels: search

Easyadapter

Recyclerview adapter library- Create adapter in just 3 lines of code

Stars: ✭ 122 (-93.33%)

Mutual labels: search

Easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Stars: ✭ 13,379 (+631.49%)

Mutual labels: ocr

61-120 of 1787 similar projects

‹

›

next*5