All Projects → karan1149 → crohme-data-extractor

karan1149 / crohme-data-extractor

Licence: other
A modified extractor for the CROHME handwritten math symbols dataset.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to crohme-data-extractor

qresExtract
Qt binary resource (qres) extractor
Stars: ✭ 26 (+44.44%)
Mutual labels:  extractor, extract
catseye
Neural network library written in C and Javascript
Stars: ✭ 29 (+61.11%)
Mutual labels:  mnist
RecursiveExtractor
RecursiveExtractor is a .NET Standard 2.0 archive extraction Library, and Command Line Tool which can process 7zip, ar, bzip2, deb, gzip, iso, rar, tar, vhd, vhdx, vmdk, wim, xzip, and zip archives and any nested combination of the supported formats.
Stars: ✭ 109 (+505.56%)
Mutual labels:  extractor
MNIST-adversarial-images
Create adversarial images to fool a MNIST classifier in TensorFlow
Stars: ✭ 13 (-27.78%)
Mutual labels:  mnist
creds harvester
Password Recovery Toolkit For Windows Written in Python 3
Stars: ✭ 16 (-11.11%)
Mutual labels:  extract
crittr
High performance critical css extraction with a great configuration abilities
Stars: ✭ 39 (+116.67%)
Mutual labels:  extract
playing with vae
Comparing FC VAE / FCN VAE / PCA / UMAP on MNIST / FMNIST
Stars: ✭ 53 (+194.44%)
Mutual labels:  mnist
cuda-neural-network
Convolutional Neural Network with CUDA (MNIST 99.23%)
Stars: ✭ 118 (+555.56%)
Mutual labels:  mnist
numpy-neuralnet-exercise
Implementation of key concepts of neuralnetwork via numpy
Stars: ✭ 49 (+172.22%)
Mutual labels:  mnist
gradient-boosted-decision-tree
GBDT (Gradient Boosted Decision Tree: 勾配ブースティング) のpythonによる実装
Stars: ✭ 49 (+172.22%)
Mutual labels:  mnist
MNIST
Handwritten digit recognizer using a feed-forward neural network and the MNIST dataset of 70,000 human-labeled handwritten digits.
Stars: ✭ 28 (+55.56%)
Mutual labels:  mnist
CTR-tools
Crash Team Racing (PS1) tools - a C# framework by DCxDemo and a set of tools to parse files found in the original kart racing game by Naughty Dog.
Stars: ✭ 93 (+416.67%)
Mutual labels:  extractor
PaperSynth
Handwritten text to synths!
Stars: ✭ 18 (+0%)
Mutual labels:  mnist
ingredients
Extract recipe ingredients from any recipe website on the internet.
Stars: ✭ 96 (+433.33%)
Mutual labels:  extractor
gan-vae-pretrained-pytorch
Pretrained GANs + VAEs + classifiers for MNIST/CIFAR in pytorch.
Stars: ✭ 134 (+644.44%)
Mutual labels:  mnist
CVparser
CVparser is software for parsing or extracting data out of CV/resumes.
Stars: ✭ 28 (+55.56%)
Mutual labels:  extract
icoextract
Extract icons from Windows PE files (.exe/.dll)
Stars: ✭ 56 (+211.11%)
Mutual labels:  extract
image-defect-detection-based-on-CNN
TensorBasicModel
Stars: ✭ 17 (-5.56%)
Mutual labels:  mnist
acefile
read/test/extract ACE 1.0 and 2.0 archives in pure python
Stars: ✭ 67 (+272.22%)
Mutual labels:  extract
colorama
A Gem for extracting the most prevalent colors from an image
Stars: ✭ 20 (+11.11%)
Mutual labels:  extract

CROHME Data Extractor

This is a series of scripts for extracting and analyzing the CROHME dataset of handwritten math symbols. Images are in a format (InkXML) that allows them to be scaled and generated at any size so that they can be a drop-in for many datasets, including MNIST. Image extraction settings can be easily configured to make the images work better with certain types of models (e.g. CNNs).

This is a slightly modified version of the extractor found here. The main modifications were using PIL.ImageDraw instead of scikit-image to draw lines, so the extract script now draws realistic lines of a desired thickness (default 3px), and also all of the output images are now saved to a folder instead of a pickle binary. The point of drawing realistic lines is that some models, especially CNNs, seem to do poorly when lines have no thickness (no edges to detect!). With images extracted from this modified version, I am able to run a AC-GAN model to conditionally generate new math symbols, whereas I was not able to do this with the original extractor.

Setup

Python version: 3.5.

  1. Extract CROHME_full_v2.zip (found inside data directory) contents before running any of the above scripts.

  2. Install specified dependencies with pip (Python Package Manager) using the following command:

pip install -U -r requirements.txt

Scripts info

  1. extract.py script will extract square-shaped bitmaps.
    With this script, you have control over data being extracted, namely:

    • Extracting data belonging to certain dataset version.
    • Extracting certain categories of classes, like digits or greek (see categories.txt for details).
    • Extracting images with line strokes drawn of any desired thickness (default 3px)

    Usage: python extract.py <out_format> <box_size> <dataset_version=2013> <category=all>

    Example usage: python extract.py pixels 32 2011+2012+2013 digits+operators+lowercase_letters+greek

  2. visualize.py script will plot single figure containing a random batch of your extracted data.

    Usage: visualize.py <number_of_samples> <number_of_columns=4>

    Example usage: python visualize.py 40 8

    Plot: crohme_extractor_plot

  3. extract_hog.py script will extract HoG features.
    This script accepts 1 command line argument, namely hog_cell_size.
    hog_cell_size corresponds to pixels_per_cell parameter of skimage.feature.hog function.
    We use skimage.feature.hog to extract HoG features.
    Example of script execution: python extract_hog.py 5 <-- pixels_per_cell=(5, 5)
    This script loads data previously dumped by parse.py and again dumps its outputs(train, test) separately.

  4. extract_phog.py script will extract PHoG features.
    For PHoG features, HoG feature maps using different cell sizes are concatenated into a single feature vector.
    So this script takes arbitrary number of hog_cell_size values(HoG features have to be previously extracted with extract_hog.py)
    Example of script execution: python extract_phog.py 5 10 20 <-- loads HoGs with respectively 5x5, 10x10, 20x20 cell sizes.

  5. histograms folder contains histograms representing distribution of labels based on different label categories. These diagrams help you better understand extracted data.

Distribution of labels

all_labels_distribution Labels were combined from train and test sets.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].