All Projects → fourdigits → wagtail_textract

fourdigits / wagtail_textract

Licence: BSD-3-Clause license
Text extraction for Wagtail document search

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to wagtail textract

Lambda Text Extractor
AWS Lambda functions to extract text from various binary formats.
Stars: ✭ 159 (+488.89%)
Mutual labels:  tesseract, text-extraction
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (+262.96%)
Mutual labels:  tesseract, text-extraction
Vehicle-Number-Plate-Reading
Read Vehicle Number Plate and store the data in a CSV file with date and time.
Stars: ✭ 47 (+74.07%)
Mutual labels:  tesseract
aws-tutorial-code
AWS tutorial code.
Stars: ✭ 114 (+322.22%)
Mutual labels:  textract
Django-wagtailmedium
A Medium Editor integration for the Wagtail CMS.
Stars: ✭ 17 (-37.04%)
Mutual labels:  wagtail
wagtailgridder
Wagtail Gridder is a Bootstrap 4 enabled layout for the Wagtail CMS. Grid Items are created within categories, and displayed on a Grid Index Page. The JavaScript libraries Gridder and MixItUp are included.
Stars: ✭ 59 (+118.52%)
Mutual labels:  wagtail
nimtesseract
A Tesseract OCR wrapper for Nim
Stars: ✭ 23 (-14.81%)
Mutual labels:  tesseract
wagtail-cache
A simple page cache for Wagtail based on the Django cache middleware.
Stars: ✭ 63 (+133.33%)
Mutual labels:  wagtail
OCRmyPDF
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Stars: ✭ 6,560 (+24196.3%)
Mutual labels:  tesseract
wagtailcolumnblocks
Streamfield columns for Wagtail
Stars: ✭ 38 (+40.74%)
Mutual labels:  wagtail
textextractor2.0
🔥 This web app extracts text in an image.
Stars: ✭ 16 (-40.74%)
Mutual labels:  text-extraction
tesseract-unity
Standalone OCR plugin for Unity using Tesseract
Stars: ✭ 35 (+29.63%)
Mutual labels:  tesseract
wagtailmath
Beautiful equations in your StreamField content
Stars: ✭ 27 (+0%)
Mutual labels:  wagtail
pari
Django/Wagtail based PARI webapp
Stars: ✭ 32 (+18.52%)
Mutual labels:  wagtail
react-streamfield
Powerful field for inserting multiple blocks with nesting. (NO LONGER MAINTAINED - See Wagtail 2.13 Release Notes)
Stars: ✭ 34 (+25.93%)
Mutual labels:  wagtail
wagtail-pg-search-backend
PostgreSQL full text search backend for Wagtail CMS
Stars: ✭ 22 (-18.52%)
Mutual labels:  wagtail
ruzzle-solver
A python script that solves ruzzle boards
Stars: ✭ 46 (+70.37%)
Mutual labels:  tesseract
MathSolver
⌨️Camera calculator with Vision
Stars: ✭ 70 (+159.26%)
Mutual labels:  tesseract
mtg-card-identifier
Magic the Gathering Card Identifier using OpenCV and Tesseract
Stars: ✭ 18 (-33.33%)
Mutual labels:  tesseract
wagtailyoast
Wagtail + Yoast
Stars: ✭ 22 (-18.52%)
Mutual labels:  wagtail

Build Status Coverage Report

Text extraction for Wagtail document search

This package is for replacing Wagtail's Document class with one that allows searching in Document file contents using textract.

Textract can extract text from (among others) PDF, Excel and Word files.

The package was inspired by the "Search: Extract text from documents" issue in Wagtail.

Documents will work as before, except that Document search in Wagtail's admin interface will also find search terms in the files' contents.

Some screenshots to illustrate.

In our fresh Wagtail site with wagtail_textract installed, we uploaded a file called test_document.pdf with handwritten text in it. It is listed in the admin interface under Documents:

Document List

If we now search in Documents for the word correct, which is one of the handwritten words, the live search finds it:

Document Search finds PDF by searching for "staple"

The assumption is that this search should not only be available in Wagtail's admin interface, but also in a public-facing search view, for which we provide a code example.

Requirements

Maturity

We have been using this package in production since August 2018 on https://nuffic.nl.

Installation

  • Install the Textract dependencies
  • Add wagtail_textract to your requirements and/or pip install wagtail_textract
  • Add to your Django INSTALLED_APPS.
  • Put WAGTAILDOCS_DOCUMENT_MODEL = "wagtail_textract.document" in your Django settings.

Note: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):

requests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.
textract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.

We haven't seen this leading to problems, but it's something to keep in mind.

Tesseract

In order to make textract use Tesseract, which happens if regular textract finds no text, you need to add the data files that Tesseract can base its word matching on.

Create a tessdata directory in your project directory, and download the languages you want.

Transcribing

Transcription is done automatically after Document save, in an asyncio executor to prevent blocking the response during processing.

To transcribe all existing Documents, run the management command::

./manage.py transcribe_documents

This may take a long time, obviously.

Usage in custom view

Here is a code example for a search view (outside Wagtail's admin interface) that shows both Page and Document results.

from itertools import chain

from wagtail.core.models import Page
from wagtail.documents.models import get_document_model


def search(request):
    # Search
    search_query = request.GET.get('query', None)
    if search_query:
        page_results = Page.objects.live().search(search_query)
        document_results = Document.objects.search(search_query)
        search_results = list(chain(page_results, document_results))

        # Log the query so Wagtail can suggest promoted results
        Query.get(search_query).add_hit()
    else:
        search_results = Page.objects.none()

    # Render template
    return render(request, 'website/search_results.html', {
        'search_query': search_query,
        'search_results': search_results,
    })

Your template should allow for handling Documents differently than Pages, because you can't do pageurl result on a Document:

{% if result.file %}
   <a href="{{ result.url }}">{{ result }}</a>
{% else %}
   <a href="{% pageurl result %}">{{ result }}</a>
{% endif %}

What if you already use a custom Document model?

In order to use wagtail_textract, your CustomizedDocument model should do the same as wagtail_textract's Document:

  • subclass TranscriptionMixin
  • alter search_fields
from wagtail_textract.models import TranscriptionMixin


class CustomizedDocument(TranscriptionMixin, ...):
    """Extra fields and methods for Document model."""
    search_fields = ... + [
        index.SearchField(
            'transcription',
            partial_match=False,
        ),
    ]

Note that the first class to subclass should be TranscriptionMixin, so its save() takes precedence over that of the other parent classes.

Tests

To run tests, checkout this repository and:

make test

Coverage

A coverage report will be generated in ./coverage_html_report/.

Contributors

  • Karl Hobley
  • Bertrand Bordage
  • Kees Hink
  • Tom Hendrikx
  • Coen van der Kamp
  • Mike Overkamp
  • Thibaud Colas
  • Dan Braghis
  • Dan Swain
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].