All Projects → dbmdz → solr-ocrpayload-plugin

dbmdz / solr-ocrpayload-plugin

Licence: MIT License
Efficient indexing and retrieval of OCR bounding boxes in Solr

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to solr-ocrpayload-plugin

Paperless
Scan, index, and archive all of your paper documents
Stars: ✭ 7,662 (+34727.27%)
Mutual labels:  search, ocr
Open Semantic Search
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
Stars: ✭ 386 (+1654.55%)
Mutual labels:  search, ocr
Open Semantic Search Apps
Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)
Stars: ✭ 55 (+150%)
Mutual labels:  search, ocr
Opensearchserver
Open-source Enterprise Grade Search Engine Software
Stars: ✭ 408 (+1754.55%)
Mutual labels:  search, ocr
Ambar
🔍 Ambar: Document Search Engine
Stars: ✭ 1,829 (+8213.64%)
Mutual labels:  search, ocr
VehicleInfoOCR
Use your camera to read number plates and obtain vehicle details. Simple, ad-free and faster alternative to existing playstore apps
Stars: ✭ 35 (+59.09%)
Mutual labels:  ocr
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (+345.45%)
Mutual labels:  ocr
InteractiveCodeSearch.jl
Interactively search Julia code from terminal
Stars: ✭ 74 (+236.36%)
Mutual labels:  search
screenshot-actions
Dunst actions for screenshots (OCR, upload to 0x0.st, delete, rename, move to/from clipboard)
Stars: ✭ 49 (+122.73%)
Mutual labels:  ocr
car-OCR
基于机器学习和OCR的车牌识别系统 @fujunhao
Stars: ✭ 39 (+77.27%)
Mutual labels:  ocr
homepage
Custom Start/home page (multi LIVE search) with live animated weather and news ticker - written in HTML/JS. Minimal, self-hosted, and dope.
Stars: ✭ 35 (+59.09%)
Mutual labels:  search
svelte-search
Accessible, customizable Svelte search component
Stars: ✭ 17 (-22.73%)
Mutual labels:  search
BM
The Utility to Install Songs, Install Mods, Install/Update BMBF, Install HitSounds, download automatically made Playlists, get better support, switch between the modded and unmodded Version of Beat Saber, do full Backups and way more
Stars: ✭ 33 (+50%)
Mutual labels:  search
ScreenAccess
Anti Recoil system with weapon type built-in recognition based on OCR, currently support next games: Apex Legends
Stars: ✭ 41 (+86.36%)
Mutual labels:  ocr
indieweb-search
Source code for the IndieWeb search engine.
Stars: ✭ 16 (-27.27%)
Mutual labels:  search
kaki
Search tool designed for developers
Stars: ✭ 41 (+86.36%)
Mutual labels:  search
OCR-Reader
An Android app to extract text from camera preview directly.
Stars: ✭ 43 (+95.45%)
Mutual labels:  ocr
pdf2xml-viewer
A simple viewer and inspection tool for text boxes in PDF documents
Stars: ✭ 82 (+272.73%)
Mutual labels:  ocr
SlideBar
SlideBar for Android 一个很好用的联系人字母快速索引
Stars: ✭ 47 (+113.64%)
Mutual labels:  search
PRLib
Pre-Recognition Library - library with algorithms for improving OCR quality.
Stars: ✭ 22 (+0%)
Mutual labels:  ocr

🚧 Deprecated in favor of solr-ocrhighlighting

Solr OCR Coordinate Payload Plugin

Javadocs Build Status Codecov MIT License GitHub release Maven Central

Efficient indexing and bounding-box "highlighting" for OCR text

tl;dr

  • Store OCR bounding box information and token position directly in the Solr index in a space-efficient manner
  • Retrieve bounding box and token position directly in your Solr query results, no additional parsing necessary

Indexing:

The OCR information is appended after each token as a concatenated list of <key>:<val> pairs, see further down for a detailed description of available keys.

POST /solr/mycore/update

[{ "id": "test_document",
   "ocr_text": "this|p:13,l:5,n:6,x:11.1,y:22.2,w:33.3,h:44.4 is|p:13,l:5,n:7,x:22.2,y:33.3,w:44.4,h:55.5 a|p:13,l:5,n:8,x:33.3,y:33.3,w:44.4,h:55.5 test|p:13,l:5,n:9,x:44.4,y:33.3,w:44.4h:55.5" }]

Querying:

The plugin adds a new top-level key (ocr_highlight in this case) that contains the OCR information for each matching token as a structured object.

GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&wt=json&q=test

{
  "responseHeader": "...",
  "response": {
    "numFound": 1,
    "docs": [{"id": "test_document"}]
  },
  "ocr_highlight":{
    "test_document":{
      "ocr_text":[{
          "term":"test",
          "page":13,
          "line": 5,
          "word": 9,
          "x":0.444,
          "y":0.333,
          "width":0.444,
          "height":0.555}]
    }
  }
}

Use Case

At the Bavarian State Library, we try to provide full-text search over all of our OCRed content. In addition to obtaining matching documents, the user should also get a small snippet of the corresponding part of the page image, with the matching words highlighted, similar to what e.g. Google Books provides.

Approaches

For this to work, we need some way of mapping matching tokens to their corresponding location in the underlying OCR text. A common approach used by a number of libraries is to use a secondary microservice for this that takes as input a document identifier and a text snippet and will return all coordinates of matching text snippets on the page. While this approach generally works okay, it has several drawbacks:

  • Performance: Every snippet requires a query to the OCR service, which itself has to do a linear scan through the OCR document. For e.g. a result set of 100 snippets, this will result in 101 queries (initial Solr query and 100 snippet queries). Of course this can be optimized by batching and having a good index structure for the coordinate lookup, but it's still less than ideal.
  • Storage: To reliably be able to map text matches to the base text, you have to store a copy of the full text in the index, alongside the regular index. This blows up the index size significantly. Foregoing storing the text and only using the normalized terms from the index for matching will break the mapping to OCR, since depending on the analyzer configuration, Lucene will perform stemming, etc.

Alternatively, you could also store the coordinates directly as strings in the index. This works by e.g. indexing each token as <token>|<coordinates> and telling Lucene to ignore everything after the pipe during analysis. As the full text of the document is stored, you wil get back a series of these annotated tokens as query results and can then parse the coordinates from your highlighting information. This solves the Performance part of the above approach, but worsens the Storage problem: For every token, we now not only have to store the token itself, but an expensive coordinate string as well.

Our Approach

This plugin uses a similar approach to the above, but solves the Storage problem by using an efficient binary format to store the OCR coordinate information in the index: We use bit-packing to combine a number of OCR coordinate parameters into a byte payload, which is not stored in the field itself, but as an associated Lucene Payload:

  • x, y, w, h: Coordinates of the bounding box on the page as either:
    • absolute unsigned integer offsets between 0 and 2^coordinateBits (see below)
    • relative floating point percentages between 0 and 100 (e.g. x:42.3 for a horizontal offset of 43.2%)
  • pageIndex: Unsigned integer that stores the page index of a token (optional)
  • lineIndex: Unsigned integer that stores the line index of a token (optional)
  • wordIndex: Unsigned integer that stores the word index of a token (optional)

For each of these values, you can configure the number of bits the plugin should use to store them, or disable certain parameters entirely. This allows you to fine-tune the settings to your needs. In our case, for example, we use these values: 4 * 12 bits (coordinates) + 9 bits (word index) + 11 bits (line index) + 12 bits (page index), resulting in a 80 bit or 10 byte payload per token. A comparable string representation p0l0n0x000y000w000h000 would have at least 22 bytes, so we save >50% for every token.

At query time, we then retrieve the payload for each matching token and put the decoded information into the ocr_highlight result key that can be directly used without having to do any additional parsing.

Usage

Installation

Download the latest release from GitHub and put the JAR into your $SOLR_HOME/$SOLR_CORE/lib/ directory.

Indexing configuration

To use it, first add the DelimitedOcrInfoPayloadTokenFilterFactory filter to your analyzer chain (e.g. for a ocr_text field type):

<fieldtype name="text_ocr" class="solr.TextField" omitTermFreqAndPositions="false">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="de.digitalcollections.lucene.analysis.util.DelimitedOcrInfoPayloadTokenFilterFactory"
            delimiter="" absoluteCoordinates="false" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>

The filter takes the following parameters:

  • delimiter: Character used for delimiting the payload from the token in the input document (default: |)
  • absoluteCoordinates: true or false to configure whether the stored coordinates are absolute
  • coordinateBits: Number of bits to use for encoding OCR coordinates in the index. (mandatory)
    A value of 10 (default) is recommended, resulting in coordBits to approximately two decimal places.
  • wordBits: Number of bits to use for encoding the word index.
    Set to 0 (default) to disable storage of the word index.
  • lineBits: Number of bits to use for encoding the line index.
    Set to 0 (default) to disable storage of the line index.
  • pageBits: Number of bits to use for encoding the page index.
    Set to 0 (default) to disable storage of the page index.

The filter expects an input payload after the configured delimiter in the input stream, with the payload being a pseudo-JSON structure (e.g. k1:1,k2:3) with the following keys:

  • p: Page index (if pageBits > 0)
  • l: Line index (if lineBits > 0)
  • n: Word index (if wordBits > 0)
  • x, y, w, h: Coordinates of the OCR box as floating point percentages or integers (if absoluteCoordinates)

As an example, consider the token foobar with an OCR box of (0.50712, 0.31432, 0.87148, 0.05089) (i.e. with absoluteCoordinates="false"), the configured delimiter and storage of indices for the word (30), line (12) and page (13): foobar☞p:13,l:12,n:30,x:50.7,y:31.4,w:87.1,h:5.1.

Alternatively, with absoluteCoordinates="true", an OCR box of (512, 1024, 3192, 256) and otherwise the same settings: foobar☞p:13,l:12,n:30,x:512,y:1024,w:3192,h:256.

Finally, you just have to configure your schema to use the field type defined above. Storing the content is not recommended, since it significantly increases the index size and is not used at all for querying and highlighting:

<field name="ocr_text" type="text_ocr" indexed="true" stored="false" />

Highlighting configuration

To enable highlighting using the OCR payloads, add the OcrHighlighting component to your Solr configuration, configure it with the same absoluteCoordinates, coordinateBits, wordBits, lineBits and pageBits values that were used for the filter in the analyzer chain:

<config>
  <searchComponent name="ocr_highlight"
                   class="de.digitalcollections.solr.plugin.components.ocrhighlighting.OcrHighlighting"
                   absoluteCoordinates="false" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
                   
  <requestHandler name="standard" class="solr.StandardRequestHandler">
    <arr name="last-components">
      <str>ocr_highlight</str>
    </arr>
  </requestHandler>
</config>

Now at query time, you can just set the ocr_hl=true parameter, specify the fields you want highlighted via ocr_hl.fields=myfield,myotherfield and retrieve highlighted matches with their OCR coordinates:

GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&q=augsburg&wt=json

{
  "responseHeader":{
    "status":0,
    "QTime":158},
  "response":{"numFound":526,"start":0,"docs":[
      {
        "id":"bsb10502835"},
      {
        "id":"bsb11032147"},
      {
        "id":"bsb10485243"},
      ...
  },
  "ocr_highlight":{
    "bsb10502835":{
      "ocr_text":[{
          "page":7,
          "position":9,
          "term":"augsburg",
          "x":0.111,
          "y":0.062,
          "width":0.075,
          "height":0.013},
        {
          "page":7,
          "position":264,
          "term":"augsburg",
          "x":0.320,
          "y":0.670,
          "width":0.099,
          "height":0.012},
        ...]}},
       ...
    }
  }
}

FAQ

  • How does highlighting work with phrase queries?

    You will receive a bounding box object for every individual matching term in the phrase.

  • What are the performance and storage implications of using this plugin?

    Performance: With an Intel Xeon [email protected] on a single core, we measured (with JMH):

    • Encoding the Payload: 1,484,443.200 Payloads/Second or ~14.2MiB/s with an 80bit payload
    • Decoding the Payload: 1,593,036.372 Payloads/Second or ~15.2MiB/s with an 80bit payload

    Storage: This depends on your configuration. With our sample configuration of an 80 bit payload (see above), the payload overhead is 10 bytes per token. That is, for a corpus size of 10 Million Tokens, you will need approximately 95MiB to store the payloads. The actual storage required might be lower, since Lucene compresses the payloads with LZ4.

  • Does this work with SolrCloud?

    It does! We're running it with SolrCloud ourselves.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].