Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → rkcosmos → Deepcut

rkcosmos / Deepcut

Licence: mit

A Thai word tokenization library using Deep Neural Network

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning tensorflow keras deep-neural-networks segmentation keras-tensorflow

Projects that are alternatives of or similar to Deepcut

Keras Unet

Helper package with multiple U-Net implementations in Keras as well as useful utility tools helpful when working with image semantic segmentation tasks. This library and underlying tools come from multiple projects I performed working on semantic segmentation tasks

Stars: ✭ 196 (-40.61%)

Mutual labels: deep-neural-networks, segmentation, keras-tensorflow

Dkeras

Distributed Keras Engine, Make Keras faster with only one line of code.

Stars: ✭ 181 (-45.15%)

Mutual labels: deep-neural-networks, keras-tensorflow

Paddlex

PaddlePaddle End-to-End Development Toolkit（『飞桨』深度学习全流程开发工具）

Stars: ✭ 3,399 (+930%)

Mutual labels: deep-neural-networks, segmentation

Trixi

Manage your machine learning experiments with trixi - modular, reproducible, high fashion. An experiment infrastructure optimized for PyTorch, but flexible enough to work for your framework and your tastes.

Stars: ✭ 211 (-36.06%)

Mutual labels: deep-neural-networks, segmentation

Hyperdensenet

This repository contains the code of HyperDenseNet, a hyper-densely connected CNN to segment medical images in multi-modal image scenarios.

Stars: ✭ 124 (-62.42%)

Mutual labels: deep-neural-networks, segmentation

Kiu Net Pytorch

Official Pytorch Code of KiU-Net for Image Segmentation - MICCAI 2020 (Oral)

Stars: ✭ 134 (-59.39%)

Mutual labels: deep-neural-networks, segmentation

Sparse Structured Attention

Sparse and structured neural attention mechanisms

Stars: ✭ 198 (-40%)

Mutual labels: deep-neural-networks, segmentation

Cnn Paper2

🎨 🎨 深度学习卷积神经网络教程：图像识别，目标检测，语义分割，实例分割，人脸识别，神经风格转换，GAN等🎨🎨 https://dataxujing.github.io/CNN-paper2/

Stars: ✭ 77 (-76.67%)

Mutual labels: deep-neural-networks, segmentation

Deep Unet For Satellite Image Segmentation

Satellite Imagery Feature Detection with SpaceNet dataset using deep UNet

Stars: ✭ 227 (-31.21%)

Mutual labels: deep-neural-networks, keras-tensorflow

Brainy

Brainy is a virtual MRI analyzer. Just upload the MRI scan file and get 3 different classes of tumors detected and segmented. In Beta.

Stars: ✭ 29 (-91.21%)

Mutual labels: segmentation, keras-tensorflow

Brain-MRI-Segmentation

Smart India Hackathon 2019 project given by the Department of Atomic Energy

Stars: ✭ 29 (-91.21%)

Mutual labels: segmentation, keras-tensorflow

Crfasrnn pytorch

CRF-RNN PyTorch version http://crfasrnn.torr.vision

Stars: ✭ 102 (-69.09%)

Mutual labels: deep-neural-networks, segmentation

Har Keras Cnn

Human Activity Recognition (HAR) with 1D Convolutional Neural Network in Python and Keras

Stars: ✭ 97 (-70.61%)

Mutual labels: deep-neural-networks, keras-tensorflow

Invoicenet

Deep neural network to extract intelligent information from invoice documents.

Stars: ✭ 1,886 (+471.52%)

Mutual labels: deep-neural-networks, keras-tensorflow

Niftynet

[unmaintained] An open-source convolutional neural networks platform for research in medical image analysis and image-guided therapy

Stars: ✭ 1,276 (+286.67%)

Mutual labels: deep-neural-networks, segmentation

Segmentation models

Segmentation models with pretrained backbones. Keras and TensorFlow Keras.

Stars: ✭ 3,575 (+983.33%)

Mutual labels: segmentation, keras-tensorflow

Bidaf Keras

Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2

Stars: ✭ 60 (-81.82%)

Mutual labels: deep-neural-networks, keras-tensorflow

Pointcnn

PointCNN: Convolution On X-Transformed Points (NeurIPS 2018)

Stars: ✭ 1,120 (+239.39%)

Mutual labels: deep-neural-networks, segmentation

Pytorch Unet

Tunable U-Net implementation in PyTorch

Stars: ✭ 224 (-32.12%)

Mutual labels: deep-neural-networks, segmentation

Dlpython course

Примеры для курса "Программирование глубоких нейронных сетей на Python"

Stars: ✭ 266 (-19.39%)

Mutual labels: deep-neural-networks, keras-tensorflow

View All Similar Projects ➔

Deepcut

A Thai word tokenization library using Deep Neural Network.

What's new

v0.7.0 Migrate from keras to TensorFlow 2.0
v0.6.0 Allow excluding stop words and custom dictionary, updated weight with semi-supervised learning
v0.5.2 Better pretrained weight matrix
v0.5.1 Faster tokenization by code refactorization
examples folder provide starter script for Thai text classification problem
DeepcutJS, you can try tokenizing Thai text on web browser here

Performance

The Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow

Precision	Recall	F1
97.8%	98.5%	98.1%

Installation

Install using pip for stable release (tensorflow version2.0),

pip install deepcut

For latest development release (recommended),

pip install git+git://github.com/rkcosmos/deepcut.git

If you want to use tensorflow version 1.x and standalone keras, you will need

pip install deepcut==0.6.1

Docker

First, install and run docker on your machine. Then, you can build and run deepcut as follows

docker build -t deepcut:dev . # build docker image
docker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system

This will open a shell for us to play with deepcut.

Usage

import deepcut
deepcut.tokenize('ตัดคำได้ดีมาก')

Output will be in list format

['ตัดคำ','ได้','ดี','มาก']

Bag-of-word transformation

We implemented a tokenizer which works similar to CountVectorizer from scikit-learn . Here is an example usage:

from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
                             max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix
print(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix

X_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text
print(X_test.shape) # 2 x 6 CSR sparse matrix

tokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later

You can load the saved tokenizer to use later

tokenizer = deepcut.load_model('tokenizer.pickle')
X_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])
print(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test

Custom Dictionary

User can add custom dictionary by adding path to .txt file with one word per line like the following.

ขี้เกียจ
โรงเรียน
ดีมาก

The file can be placed as an custom_dict argument in tokenize function e.g.

deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary

Notes

Some texts might not be segmented as we would expected (e.g.'โรงเรียน' -> ['โรง', 'เรียน']), this is because of

BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)
They are unseen/new words -> Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.

Any suggestion and comment are welcome, please post it in issue section.

Contributors

Citations

If you use deepcut in your project or publication, please cite the library as follows

Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,
Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.
(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707

or BibTeX entry:

@misc{Kittinaradorn2019,
    author       = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},
    title        = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},
    month        = Sep,
    year         = 2019,
    doi          = {10.5281/zenodo.3457707},
    version      = {1.0},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3457707}
}

Partner Organizations

True Corporation

We are open for contribution and collaboration.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 330

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗