All Projects → jiesutd → Richwordsegmentor

jiesutd / Richwordsegmentor

Neural word segmentation with rich pretraining, code for ACL 2017 paper

Projects that are alternatives of or similar to Richwordsegmentor

Deep Learning Based Ecg Annotator
Annotation of ECG signals using deep learning, tensorflow’ Keras
Stars: ✭ 110 (-21.43%)
Mutual labels:  lstm, segmentation
Bcdu Net
BCDU-Net : Medical Image Segmentation
Stars: ✭ 314 (+124.29%)
Mutual labels:  lstm, segmentation
Canet
The code for paper "CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning"
Stars: ✭ 135 (-3.57%)
Mutual labels:  segmentation
Ethnicolr
Predict Race and Ethnicity Based on the Sequence of Characters in a Name
Stars: ✭ 137 (-2.14%)
Mutual labels:  lstm
Vpilot
Scripts and tools to easily communicate with DeepGTAV. In the future a self-driving agent will be implemented.
Stars: ✭ 136 (-2.86%)
Mutual labels:  lstm
Easyocr
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Stars: ✭ 13,379 (+9456.43%)
Mutual labels:  lstm
Dmm net
Differentiable Mask-Matching Network for Video Object Segmentation (ICCV 2019)
Stars: ✭ 138 (-1.43%)
Mutual labels:  segmentation
Dilation Tensorflow
A native Tensorflow implementation of semantic segmentation according to Multi-Scale Context Aggregation by Dilated Convolutions (2016). Optionally uses the pretrained weights by the authors.
Stars: ✭ 134 (-4.29%)
Mutual labels:  segmentation
Deep Learning For Tracking And Detection
Collection of papers, datasets, code and other resources for object tracking and detection using deep learning
Stars: ✭ 1,920 (+1271.43%)
Mutual labels:  segmentation
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+1162.14%)
Mutual labels:  lstm
Document Classifier Lstm
A bidirectional LSTM with attention for multiclass/multilabel text classification.
Stars: ✭ 136 (-2.86%)
Mutual labels:  lstm
Deeplearningfornlpinpytorch
An IPython Notebook tutorial on deep learning for natural language processing, including structure prediction.
Stars: ✭ 1,744 (+1145.71%)
Mutual labels:  lstm
Kiu Net Pytorch
Official Pytorch Code of KiU-Net for Image Segmentation - MICCAI 2020 (Oral)
Stars: ✭ 134 (-4.29%)
Mutual labels:  segmentation
Question Pairs Matching
第三届魔镜杯 智能客服问题相似性算法设计 第12名解决方案
Stars: ✭ 138 (-1.43%)
Mutual labels:  lstm
Handwriting Synthesis
Implementation of "Generating Sequences With Recurrent Neural Networks" https://arxiv.org/abs/1308.0850
Stars: ✭ 135 (-3.57%)
Mutual labels:  lstm
Lung Segmentation 2d
Lung fields segmentation on CXR images using convolutional neural networks.
Stars: ✭ 138 (-1.43%)
Mutual labels:  segmentation
Lstm Crf
A (CNN+)RNN(LSTM/BiLSTM)+CRF model for sequence labelling.😏
Stars: ✭ 134 (-4.29%)
Mutual labels:  lstm
Lstm Crypto Price Prediction
Predicting price trends in cryptomarkets using an lstm-RNN for the use of a trading bot
Stars: ✭ 136 (-2.86%)
Mutual labels:  lstm
Morfessor
Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
Stars: ✭ 137 (-2.14%)
Mutual labels:  segmentation
Actionrecognition
Explore Action Recognition
Stars: ✭ 139 (-0.71%)
Mutual labels:  lstm

RichWordSegmentor

RichWordSegmentor is a package for Word Segmentation using transition based neural networks under LibN3L package. It is the state-of-the-art neural word segmentator which supports rich pretraining from external data. With the help of rich pretraining, our model achieves the best result on 5 out of 6 Chinese word segmentation benchmarks. Performance details and model structure can be seen in our ACL paper: Neural word segmentation with rich pretraining.

Demo system:

  • Download the LibN3L library and configure your system. Please refer to Here
  • Open CMakeLists.txt and change " ../LibN3L/" into the directory of your LibN3L package.
  • Run the demo.sh file: sh demo.sh (didn't load pretrained char/bichar embeddings in this demo script.)

The demo system includes Chinese word segmentation sample data "train.debug", "dev.debug" and "test.debug", Chinese word embeding sample file "ctb.50d.word.debug", Chinese char and char bigram pretrained embedding sample file "char.emb" "bichar.emb"and parameter setting file"option.STD". All of these files are gathered at folder RichWordSegmentor/example.

Run:

cmake .
make

Training model:
./STDSeg -l -train ${train.data} -dev ${dev.data} -test ${dev.data} -option ${option.file} -model ${save_model_to_file} -word ${pretrain_word_emb, optional} -char ${pretrain_char_emb, optional} -bichar ${pretrain_bichar_emb, optional} -numlayer ${pretrain_parameters, optional}

Load model:
./STDSeg -test ${test.data} -model ${load_model_file} -output ${output_file}

Input:

  1. For evaluate model performance, word seperated by a space, each sentence take one line. For example:

    就 做 了 一点 微小 的 工作 , 谢谢 大家 。
    一个人 的 命运 啊 , 当然 要 靠 自我 奋斗 , 但是 也要 考虑 到 历史 的 行程 。

    Result will calculate the P/R/F automatically.

  2. For raw text decoding, one sentence each line (without space).

    就做了一点微小的工作,谢谢大家。
    一个人的命运啊,当然要靠自我奋斗,但是也要考虑到历史的行程。

Output:

The same format with training data. Word seperated by a space, each sentence take one line.

就 做 了 一点 微小 的 工作 , 谢谢 大家 。
一个人 的 命运 啊 , 当然 要 靠 自我 奋斗 , 但是 也要 考虑 到 历史 的 行程 。

Trained model/embeddings/parameters of rich pretraining and baseline:

We shared our trained model at BaiduPan(https://pan.baidu.com/s/1pLO6T9D) for visiters reproducing our results.

  1. File ctb.bilstm.joint4.model: the trained model on CTB6.0 corpus using multitask pretraining. You can simply load this file to decode raw text without training. Run:

    ./STDSeg -test ${input_raw_text} -model ctb.bilstm.joint4.model -output ${output_segmentated_text}

  2. File joint4.all.b10c1.2h.iter17.mchar, .mbichar, .pmodel are pretrained character, character bigram embeddings and representing parameters. If you want to train your own model, you can load these three files following above instruction.

  3. File: gigaword_chn.all.a2b.uni.ite50.vec, gigaword_chn.all.a2b.bi.ite50.vec and ctb.50d.vec are the char, bichar and word embeddings of our baseline, respectively.

  4. If you want to do the rich pretraining experiments (for generating three files in last item), please refer to TrainEmbMultiTask.

Monitoring information

During the running of this NER system, it may print out the follow log information:

Iter 13 finished. Total time taken is: 1260.37s
dev:
Recall: P=57508/59929=0.959602, Accuracy: P=57508/59723=0.962912, Fmeasure: 0.961254
Decode dev finished. Total time taken is: 96.299s
test:
Recall: P=77895/81579=0.954841, Accuracy: P=77895/81159=0.959783, Fmeasure: 0.957306
Decode test finished. Total time taken is: 128.9s
Exceeds best previous performance of 0.960922. Saving model file..

The first "Recall..." line shows the performance of the dev set and the second "Recall..." line shows you the performance of the test set.

Note:

  • Current version only compatible with LibN3L after Dec. 10th 2015 , which contains the model saving and loading module.
  • The example files are just to verify the running for the code. For copyright consideration, we take only hundreds of sentences as example. Hence the results on those example datasets does not represent the real performance on large dataset.

Cite:

@InProceedings{yang-zhang-dong:2017:Long,
  author    = {Yang, Jie  and  Zhang, Yue  and  Dong, Fei},
  title     = {Neural Word Segmentation with Rich Pretraining},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {839--849},
  url       = {http://aclweb.org/anthology/P17-1078}
}

Update

  • 2017-April-4: init version
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].