All Projects → dsindex → segm-lstm

dsindex / segm-lstm

Licence: other
[deprecated] reference code for string segmentation using LSTM(tensorflow)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to segm-lstm

Ffn
Flood-Filling Networks for instance segmentation in 3d volumes.
Stars: ✭ 252 (+1226.32%)
Mutual labels:  segmentation
subpixel-embedding-segmentation
PyTorch Implementation of Small Lesion Segmentation in Brain MRIs with Subpixel Embedding (ORAL, MICCAIW 2021)
Stars: ✭ 22 (+15.79%)
Mutual labels:  segmentation
Stock-Market-Predictor
Stock Market Predictor with LSTM network. Web scraping and analyzing tools (ohlc, mean)
Stars: ✭ 28 (+47.37%)
Mutual labels:  lstm-model
Meetup-Content
Entirety.ai Intuition to Implementation Meetup Content.
Stars: ✭ 33 (+73.68%)
Mutual labels:  lstm-model
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (+15.79%)
Mutual labels:  lstm-model
Shadowless
A Fast and Open Source Autonomous Perception System.
Stars: ✭ 29 (+52.63%)
Mutual labels:  segmentation
Pointrend Pytorch
A PyTorch implementation of PointRend: Image Segmentation as Rendering
Stars: ✭ 249 (+1210.53%)
Mutual labels:  segmentation
argus-tgs-salt
Kaggle | 14th place solution for TGS Salt Identification Challenge
Stars: ✭ 73 (+284.21%)
Mutual labels:  segmentation
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (+136.84%)
Mutual labels:  lstm-model
point-cloud-segmentation
TF2 implementation of PointNet for segmenting point clouds
Stars: ✭ 33 (+73.68%)
Mutual labels:  segmentation
Deep-Learning
This repo provides projects on deep-learning mainly using Tensorflow 2.0
Stars: ✭ 22 (+15.79%)
Mutual labels:  lstm-model
recurrent-neural-net
A recurrent (LSTM) neural network in C
Stars: ✭ 68 (+257.89%)
Mutual labels:  lstm-model
dd-ml-segmentation-benchmark
DroneDeploy Machine Learning Segmentation Benchmark
Stars: ✭ 179 (+842.11%)
Mutual labels:  segmentation
Predictive-Maintenance
time-series prediction for predictive maintenance
Stars: ✭ 28 (+47.37%)
Mutual labels:  lstm-model
BCNet
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]
Stars: ✭ 434 (+2184.21%)
Mutual labels:  segmentation
Tensorflow Enet
TensorFlow implementation of ENet
Stars: ✭ 251 (+1221.05%)
Mutual labels:  segmentation
GIBBON
The Geometry and Image-Based Bioengineering add-On for MATLAB
Stars: ✭ 132 (+594.74%)
Mutual labels:  segmentation
blindassist-ios
BlindAssist iOS app
Stars: ✭ 34 (+78.95%)
Mutual labels:  segmentation
MITK-Diffusion
MITK Diffusion - Official part of the Medical Imaging Interaction Toolkit
Stars: ✭ 47 (+147.37%)
Mutual labels:  segmentation
DocuNet
Code and dataset for the IJCAI 2021 paper "Document-level Relation Extraction as Semantic Segmentation".
Stars: ✭ 84 (+342.11%)
Mutual labels:  segmentation

segm-lstm

  • description

    • string segmentation(auto-spacing) using LSTM(tensorflow)
      • input
        • string, ex) '이것을띄어쓰기하면어떻게될까요'
      • output
        • string, ex) '이것을 띄어쓰기하면 어떻게 될까요'
    • model
      • x : '이것을 띄어쓰기하면 어떻게 될까요'
      • y : '0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0'
        • 1 : if next char is space
        • 0 : if next char is not space
      • learn to predict tag sequence
    • tensorflow version : 1.0
  • reference codes

    • sketch.py
    $ python sketch.py
    ...
    step : 970,cost : 0.0117462
    step : 980,cost : 0.0115485
    step : 990,cost : 0.0113553
    out = 이것을 띄어쓰기하면 어떻게 될까요
    out = 아버지가 방에 들어가신다.
    • sketch_dynamic.py
    * convert sketch.py using tf.nn.dynamic_rnn()
    * more easy to use
    
    • count_one.py
    * count the number of 1s
    
  • how to handle variable-length input

let's try to use sliding window method and early stop.

n_steps = 30

- training
  if len(sentence) >= 1 and len(sentence) < n_steps : padding with '\t'
  if len(sentence) > n_steps : move next batch pointer(sliding window)

- inference
  if len(sentence) >= 1 and len(sentence) < n_steps : padding with '\t'
  if len(sentence) > n_steps : 
    move next batch pointer(sliding window)
	merge result into one array
	decoding
  • train and inference
$ python train.py --train=train.txt --validation=validation.txt --model=model --iters=100

$ python inference.py --model=model < test.txt
...
model restored from model/segm.ckpt
out = 이것을 띄어 쓰기하면 어 떻게 될까요.
out = 아버지가 방에 들어 가신다.
out = SK이노베이션, GS, S-Oil, 대림산업, 현대중공업 등 대규모 적자를 내던
out = 기업들이 극한 구조조정을 통해 흑자로 전환하거나
out = 적자폭을 축소한 것이영 업이익 개선을 이끈 것으로 풀이된다.


$ python train.py --train=big.txt --validation=validation.txt --model=model --iters=100

$ python inference.py --model=model < test.txt
out = 이것을 띄어쓰기하면 어떻게 될 까요.
out = 아버지가 방에 들어 가 신다.
out = SK이노베이션, GS, S-Oil,대림산업, 현대 중공업등대규모적자를 내던
out = 기업들이 극한 구조조정 을 통해 흑자로 전환하거나
out = 적자폭을 축소한 것이 영업이 익개선을 이 끈것으로 풀이 된 다.

# it seems that training data is not enough...
  • character-based word2vec
# word2vec : https://github.com/tensorflow/models/tree/master/tutorials/embedding

$ git submodule update --init
# generate 'word2vec_ops.so' as described in models/tutorials/embedding/README.md

# preprocessing for character-based
$ python tochar.py < bigbig.txt > bigbig.txt.char

# train word2vec
$ mkdir emb
$ python models/tutorials/embedding/word2vec_optimized.py --train_data=bigbig.txt.char --eval_data=questions-words.txt --embedding_size=200 --save_path=emb

# test word2vec
$ cd segm-lstm
$ python test_word2vec.py --embedding_size=200 --model_path=emb
...
가
=====================================
가                  1.0000
감                  0.9716
알                  0.9695
니                  0.9681
기                  0.9680
런                  0.9659
쥬                  0.9640
...

# you can dump embedding by using embedding_dump() in test_word2vec.py
$ python test_word2vec.py --embedding_size=200 --model_path=emb --embedding_dump=1
# now you have embeddings data in emb/embedding.pickle
  • train and inference with character embedding
$ python train_emb.py --train=big.txt --validation=validation.txt --embedding=emb --model=model_emb --iters=100

$ python inference_emb.py -e emb -m model_emb < test.txt
out = 이것을 띄어쓰기하면 어떻게 될 까요.
out = 아버지가 방에 들어가 신다.
out = SK이노베이션, GS, S-Oil, 대림산업, 현대중공업등대규모적자를 내던
out = 기업들이 극한 구조조정을 통해 흑자로 전환하거나
out = 적자폭을 축소한 것 이 영업이익개선을 이 끈것으로 풀이된 다.

# prepare bigbig.txt(53548 news articles)
$ python train_emb.py --train=bigbig.txt --validation=validation.txt --embedding=emb --model=model_emb --iters=100
...
53545 th sentence ... done
53546 th sentence ... done
53547 th sentence ... done
seq : 2,validation cost : 7.31046978633,validation accuracy : 0.905555615822
save model(final)
end of training

# it takes 3 days long. ;;

$ python inference_emb.py -e emb -m model_emb < test.txt
out = 이것을 띄어쓰기하면 어떻게 될 까요.
out = 아버지가 방에 들어가 신다.
out = SK 이 노베이션, GS, S-Oil, 대림산업, 현대중공업등대규모적자를 내던
out = 기업들이 극한 구조조정을 통해 흑자로 전환하거나
out = 적자폭을 축소한 것이 영업이 익개선을 이 끈것으로 풀이 된다.

$ python inference_emb.py -e emb -m model_emb
유치원음악회가열리는날입니다.
out = 유치원음악회가 열리는 날 입니다.
친구들은커서무엇이되고싶습니까
out = 친구들은 커서 무엇이 되고 싶습니까
  • development note
- training speed is very slow despite of using GPU. 
  how make it faster?
  - increasing batch_size
    we need some tricky code works that process file to generate batch using `yield`
  - increasing number of threads
  - using distributed training
- tuning points
  - trained model from news corpus seems to be weak for verbal words. so we need to prepare a verbal corpus from somewhere.
    - ex) '날이에요','싶나요','해요'
  - iterations
  - hidden layer dimension
  - embedding dimension
- when train_emb.py is running, it is not possible to run train.py simultaneously.
  we need to figure out.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].