All Projects → rkcosmos → Deepcut

rkcosmos / Deepcut

Licence: mit
A Thai word tokenization library using Deep Neural Network

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Deepcut

Keras Unet
Helper package with multiple U-Net implementations in Keras as well as useful utility tools helpful when working with image semantic segmentation tasks. This library and underlying tools come from multiple projects I performed working on semantic segmentation tasks
Stars: ✭ 196 (-40.61%)
Mutual labels:  deep-neural-networks, segmentation, keras-tensorflow
Dkeras
Distributed Keras Engine, Make Keras faster with only one line of code.
Stars: ✭ 181 (-45.15%)
Mutual labels:  deep-neural-networks, keras-tensorflow
Paddlex
PaddlePaddle End-to-End Development Toolkit(『飞桨』深度学习全流程开发工具)
Stars: ✭ 3,399 (+930%)
Mutual labels:  deep-neural-networks, segmentation
Trixi
Manage your machine learning experiments with trixi - modular, reproducible, high fashion. An experiment infrastructure optimized for PyTorch, but flexible enough to work for your framework and your tastes.
Stars: ✭ 211 (-36.06%)
Mutual labels:  deep-neural-networks, segmentation
Hyperdensenet
This repository contains the code of HyperDenseNet, a hyper-densely connected CNN to segment medical images in multi-modal image scenarios.
Stars: ✭ 124 (-62.42%)
Mutual labels:  deep-neural-networks, segmentation
Kiu Net Pytorch
Official Pytorch Code of KiU-Net for Image Segmentation - MICCAI 2020 (Oral)
Stars: ✭ 134 (-59.39%)
Mutual labels:  deep-neural-networks, segmentation
Sparse Structured Attention
Sparse and structured neural attention mechanisms
Stars: ✭ 198 (-40%)
Mutual labels:  deep-neural-networks, segmentation
Cnn Paper2
🎨 🎨 深度学习 卷积神经网络教程 :图像识别,目标检测,语义分割,实例分割,人脸识别,神经风格转换,GAN等🎨🎨 https://dataxujing.github.io/CNN-paper2/
Stars: ✭ 77 (-76.67%)
Mutual labels:  deep-neural-networks, segmentation
Deep Unet For Satellite Image Segmentation
Satellite Imagery Feature Detection with SpaceNet dataset using deep UNet
Stars: ✭ 227 (-31.21%)
Mutual labels:  deep-neural-networks, keras-tensorflow
Brainy
Brainy is a virtual MRI analyzer. Just upload the MRI scan file and get 3 different classes of tumors detected and segmented. In Beta.
Stars: ✭ 29 (-91.21%)
Mutual labels:  segmentation, keras-tensorflow
Brain-MRI-Segmentation
Smart India Hackathon 2019 project given by the Department of Atomic Energy
Stars: ✭ 29 (-91.21%)
Mutual labels:  segmentation, keras-tensorflow
Crfasrnn pytorch
CRF-RNN PyTorch version http://crfasrnn.torr.vision
Stars: ✭ 102 (-69.09%)
Mutual labels:  deep-neural-networks, segmentation
Har Keras Cnn
Human Activity Recognition (HAR) with 1D Convolutional Neural Network in Python and Keras
Stars: ✭ 97 (-70.61%)
Mutual labels:  deep-neural-networks, keras-tensorflow
Invoicenet
Deep neural network to extract intelligent information from invoice documents.
Stars: ✭ 1,886 (+471.52%)
Mutual labels:  deep-neural-networks, keras-tensorflow
Niftynet
[unmaintained] An open-source convolutional neural networks platform for research in medical image analysis and image-guided therapy
Stars: ✭ 1,276 (+286.67%)
Mutual labels:  deep-neural-networks, segmentation
Segmentation models
Segmentation models with pretrained backbones. Keras and TensorFlow Keras.
Stars: ✭ 3,575 (+983.33%)
Mutual labels:  segmentation, keras-tensorflow
Bidaf Keras
Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2
Stars: ✭ 60 (-81.82%)
Mutual labels:  deep-neural-networks, keras-tensorflow
Pointcnn
PointCNN: Convolution On X-Transformed Points (NeurIPS 2018)
Stars: ✭ 1,120 (+239.39%)
Mutual labels:  deep-neural-networks, segmentation
Pytorch Unet
Tunable U-Net implementation in PyTorch
Stars: ✭ 224 (-32.12%)
Mutual labels:  deep-neural-networks, segmentation
Dlpython course
Примеры для курса "Программирование глубоких нейронных сетей на Python"
Stars: ✭ 266 (-19.39%)
Mutual labels:  deep-neural-networks, keras-tensorflow

Deepcut

License DOI

A Thai word tokenization library using Deep Neural Network.

model_structure

What's new

  • v0.7.0 Migrate from keras to TensorFlow 2.0
  • v0.6.0 Allow excluding stop words and custom dictionary, updated weight with semi-supervised learning
  • v0.5.2 Better pretrained weight matrix
  • v0.5.1 Faster tokenization by code refactorization
  • examples folder provide starter script for Thai text classification problem
  • DeepcutJS, you can try tokenizing Thai text on web browser here

Performance

The Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow

Precision Recall F1
97.8% 98.5% 98.1%

Installation

Install using pip for stable release (tensorflow version2.0),

pip install deepcut

For latest development release (recommended),

pip install git+git://github.com/rkcosmos/deepcut.git

If you want to use tensorflow version 1.x and standalone keras, you will need

pip install deepcut==0.6.1

Docker

First, install and run docker on your machine. Then, you can build and run deepcut as follows

docker build -t deepcut:dev . # build docker image
docker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system

This will open a shell for us to play with deepcut.

Usage

import deepcut
deepcut.tokenize('ตัดคำได้ดีมาก')

Output will be in list format

['ตัดคำ','ได้','ดี','มาก']

Bag-of-word transformation

We implemented a tokenizer which works similar to CountVectorizer from scikit-learn . Here is an example usage:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix
print(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix

X_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text
print(X_test.shape) # 2 x 6 CSR sparse matrix

tokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later

You can load the saved tokenizer to use later

tokenizer = deepcut.load_model('tokenizer.pickle')
X_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])
print(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test

Custom Dictionary

User can add custom dictionary by adding path to .txt file with one word per line like the following.

ขี้เกียจ
โรงเรียน
ดีมาก

The file can be placed as an custom_dict argument in tokenize function e.g.

deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary

Notes

Some texts might not be segmented as we would expected (e.g.'โรงเรียน' -> ['โรง', 'เรียน']), this is because of

  • BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)
  • They are unseen/new words -> Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.

Any suggestion and comment are welcome, please post it in issue section.

Contributors

Citations

If you use deepcut in your project or publication, please cite the library as follows

Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,
Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.
(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707

or BibTeX entry:

@misc{Kittinaradorn2019,
    author       = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},
    title        = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},
    month        = Sep,
    year         = 2019,
    doi          = {10.5281/zenodo.3457707},
    version      = {1.0},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3457707}
}

Partner Organizations

  • True Corporation

We are open for contribution and collaboration.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].