All Projects → Belval → Textrecognitiondatagenerator

Belval / Textrecognitiondatagenerator

Licence: mit
A synthetic data generator for text recognition

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to Textrecognitiondatagenerator

Crnn With Stn
implement CRNN in Keras with Spatial Transformer Network
Stars: ✭ 83 (-96%)
Mutual labels:  dataset, ocr, text-recognition
Paster
Pasting a text data from a clipboard directlly to Sketch text layers [Sketch plugin]
Stars: ✭ 88 (-95.76%)
Mutual labels:  data, text
Php Apache Tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Stars: ✭ 76 (-96.34%)
Mutual labels:  ocr, text-recognition
Iso 3166 Countries With Regional Codes
ISO 3166-1 country lists merged with their UN Geoscheme regional codes in ready-to-use JSON, XML, CSV data sets
Stars: ✭ 1,372 (-33.88%)
Mutual labels:  dataset, data
Transformer str
PyTorch implementation of my new method for Scene Text Recognition (STR) based on Transformer,Equipped with Transformer, this method outperforms the best model of the aforementioned deep-text-recognition-benchmark by 7.6% on CUTE80.
Stars: ✭ 131 (-93.69%)
Mutual labels:  ocr, text-recognition
Covid19
JSON time-series of coronavirus cases (confirmed, deaths and recovered) per country - updated daily
Stars: ✭ 1,177 (-43.28%)
Mutual labels:  dataset, data
Node Tesseract Ocr
A Node.js wrapper for the Tesseract OCR API
Stars: ✭ 92 (-95.57%)
Mutual labels:  ocr, text-recognition
Legislator
Interface to the Comparative Legislators Database
Stars: ✭ 62 (-97.01%)
Mutual labels:  dataset, data
Sightseq
Computer vision tools for fairseq, containing PyTorch implementation of text recognition and object detection
Stars: ✭ 116 (-94.41%)
Mutual labels:  ocr, text-recognition
Datasets knowledge embedding
Datasets for Knowledge Graph Completion with textual information about the entities
Stars: ✭ 116 (-94.41%)
Mutual labels:  dataset, text
Micro Jaymock
Tiny API mocking microservice for generating fake JSON data.
Stars: ✭ 123 (-94.07%)
Mutual labels:  fake, data
Sar tf
This is an implementation of Show, Attend and Read with tensorflow
Stars: ✭ 70 (-96.63%)
Mutual labels:  ocr, text-recognition
Crnn
Convolutional recurrent neural network for scene text recognition or OCR in Keras
Stars: ✭ 68 (-96.72%)
Mutual labels:  ocr, text-recognition
Colour
Colour Science for Python
Stars: ✭ 1,131 (-45.49%)
Mutual labels:  dataset, data
Githubrankingsspain
⬆️ Rankings with the most active GitHub users in Spain (sorted by public contributions) 🇪🇸
Stars: ✭ 127 (-93.88%)
Mutual labels:  dataset, data
Ml Pyxis
Tool for reading and writing datasets of tensors in a Lightning Memory-Mapped Database (LMDB). Designed to manage machine learning datasets with fast reading speeds.
Stars: ✭ 93 (-95.52%)
Mutual labels:  dataset, data
Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (-97.83%)
Mutual labels:  ocr, text
Generator Http Fake Backend
Yeoman generator for building a fake backend by providing the content of JSON files or JavaScript objects through configurable routes.
Stars: ✭ 49 (-97.64%)
Mutual labels:  fake, data
Text recognition toolbox
text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.
Stars: ✭ 114 (-94.51%)
Mutual labels:  ocr, text-recognition
Dbg Pds
Deutsche Boerse's Financial Trading Public Data Set
Stars: ✭ 124 (-94.02%)
Mutual labels:  dataset, data

TextRecognitionDataGenerator TravisCI PyPI version codecov Documentation Status

A synthetic data generator for text recognition

What is it for?

Generating text image samples to train an OCR software. Now supporting non-latin text! For a more thorough tutorial see the official documentation.

What do I need to make it work?

Install the pypi package

pip install trdg

Afterwards, you can use trdg from the CLI. I recommend using a virtualenv instead of installing with sudo.

If you want to add another language, you can clone the repository instead. Simply run pip install -r requirements.txt

Docker image

If you would rather not have to install anything to use TextRecognitionDataGenerator, you can pull the docker image.

docker pull belval/trdg:latest

docker run -v /output/path/:/app/out/ -t belval/trdg:latest trdg [args]

The path (/output/path/) must be absolute.

New

  • Add --stroke_width argument to set the width of the text stroke (Thank you @SunHaozhe)
  • Add --stroke_fill argument to set the color of the text contour if stroke > 0 (Thank you @SunHaozhe)
  • Add --word_split argument to split on word instead of per-character. This is useful for ligature-based languages
  • Add --dict argument to specify a custom dictionary (Thank you @luh0907)
  • Add --font_dir argument to specify the fonts to use
  • Add --output_mask to output character-level mask for each image
  • Add --character_spacing to control space between characters (in pixels)
  • Add python module
  • Add --font to use only one font for all the generated images (Thank you @JulienCoutault!)
  • Add --fit and --margins for finer layout control
  • Change the text orientation using the -or parameter
  • Specify text color range using -tc '#000000,#FFFFFF', please note that the quotes are necessary
  • Add support for Simplified and Traditional Chinese

How does it work?

Words will be randomly chosen from a dictionary of a specific language. Then an image of those words will be generated by using font, background, and modifications (skewing, blurring, etc.) as specified.

Basic (Python module)

The usage as a Python module is very similar to the CLI, but it is more flexible if you want to include it directly in your training pipeline, and will consume less space and memory. There are 4 generators that can be used.

from trdg.generators import (
    GeneratorFromDict,
    GeneratorFromRandom,
    GeneratorFromStrings,
    GeneratorFromWikipedia,
)

# The generators use the same arguments as the CLI, only as parameters
generator = GeneratorFromStrings(
    ['Test1', 'Test2', 'Test3'],
    blur=2,
    random_blur=True
)

for img, lbl in generator:
    # Do something with the pillow images here.

You can see the full class definition here:

Basic (CLI)

trdg -c 1000 -w 5 -f 64

You get 1,000 randomly generated images with random text on them like:

1 2 3 4 5

By default, they will be generated to out/ in the current working directory.

Text skewing

What if you want random skewing? Add -k and -rk (trdg -c 1000 -w 5 -f 64 -k 5 -rk)

6 7 8 9 10

Text distortion

You can also add distorsion to the generated text with -d and -do

23 24 25

Text blurring

But scanned document usually aren't that clear are they? Add -bl and -rbl to get gaussian blur on the generated image with user-defined radius (here 0, 1, 2, 4):

11 12 13 14

Background

Maybe you want another background? Add -b to define one of the three available backgrounds: gaussian noise (0), plain white (1), quasicrystal (2) or image (3).

15 16 17 23

When using image background (3). A image from the images/ folder will be randomly selected and the text will be written on it.

Handwritten

Or maybe you are working on an OCR for handwritten text? Add -hw! (Experimental)

18 19 20 21 22

It uses a Tensorflow model trained using this excellent project by Grzego.

The project does not require TensorFlow to run if you aren't using this feature

Dictionary

The text is chosen at random in a dictionary file (that can be found in the dicts folder) and drawn on a white background made with Gaussian noise. The resulting image is saved as [text]_[index].jpg

There are a lot of parameters that you can tune to get the results you want, therefore I recommend checking out trdg -h for more information.

Create images with Chinese text

It is simple! Just do trdg -l cn -c 1000 -w 5!

Generated texts come both in simplified and traditional Chinese scripts.

Traditional:

27

Simplified:

28

Create images with Japanese text

It is simple! Just do trdg -l ja -c 1000 -w 5!

Output

29

Add new fonts

The script picks a font at random from the fonts directory.

Directory Languages
fonts/latin English, French, Spanish, German
fonts/cn Chinese
fonts/ko Korean
fonts/ja Japanese

Simply add/remove fonts until you get the desired output.

If you want to add a new non-latin language, the amount of work is minimal.

  1. Create a new folder with your language two-letters code
  2. Add a .ttf font in it
  3. Edit run.py to add an if statement in load_fonts()
  4. Add a text file in dicts with the same two-letters code
  5. Run the tool as you normally would but add -l with your two-letters code

It only supports .ttf for now.

Benchmarks

Number of images generated per second.

  • Intel Core i7-4710HQ @ 2.50Ghz + SSD (-c 1000 -w 1)
    • -t 1 : 363 img/s
    • -t 2 : 694 img/s
    • -t 4 : 1300 img/s
    • -t 8 : 1500 img/s
  • AMD Ryzen 7 1700 @ 4.0Ghz + SSD (-c 1000 -w 1)
    • -t 1 : 558 img/s
    • -t 2 : 1045 img/s
    • -t 4 : 2107 img/s
    • -t 8 : 3297 img/s

Contributing

  1. Create an issue describing the feature you'll be working on
  2. Code said feature
  3. Create a pull request

Feature request & issues

If anything is missing, unclear, or simply not working, open an issue on the repository.

What is left to do?

  • Better background generation
  • Better handwritten text generation
  • More customization parameters (mostly regarding background)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].