All Projects → sviperm → neuro-comma

sviperm / neuro-comma

Licence: MIT license
🇷🇺 Punctuation restoration production-ready model for Russian language 🇷🇺

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to neuro-comma

anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+84.78%)
Mutual labels:  ner, bert
parsbert-ner
🤗 ParsBERT Persian NER Tasks
Stars: ✭ 15 (-67.39%)
Mutual labels:  ner, bert
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+432.61%)
Mutual labels:  ner, bert
datagrand bert
2019达观杯信息提取第5名代码
Stars: ✭ 20 (-56.52%)
Mutual labels:  ner, bert
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+4758.7%)
Mutual labels:  ner, bert
viewpoint-mining
参考NER,基于BERT的电商评论观点挖掘和情感分析
Stars: ✭ 31 (-32.61%)
Mutual labels:  ner, bert
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (+19.57%)
Mutual labels:  ner, bert
Natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
Stars: ✭ 788 (+1613.04%)
Mutual labels:  russian, ner
Bert Bilstm Crf Ner
Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
Stars: ✭ 3,838 (+8243.48%)
Mutual labels:  ner, bert
keras-bert-ner
Keras solution of Chinese NER task using BiLSTM-CRF/BiGRU-CRF/IDCNN-CRF model with Pretrained Language Model: supporting BERT/RoBERTa/ALBERT
Stars: ✭ 7 (-84.78%)
Mutual labels:  ner, bert
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (+2.17%)
Mutual labels:  russian, ner
ChineseNER
中文NER的那些事儿
Stars: ✭ 241 (+423.91%)
Mutual labels:  ner, bert
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-52.17%)
Mutual labels:  bert
OpenUE
OpenUE是一个轻量级知识图谱抽取工具 (An Open Toolkit for Universal Extraction from Text published at EMNLP2020: https://aclanthology.org/2020.emnlp-demos.1.pdf)
Stars: ✭ 274 (+495.65%)
Mutual labels:  bert
NEMO
Neural Modeling for Named Entities and Morphology (Hebrew NER)
Stars: ✭ 25 (-45.65%)
Mutual labels:  ner
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+39.13%)
Mutual labels:  bert
fias
Ruby wrapper for the Russian FIAS database (Федеральная Информационная Адресная Система)
Stars: ✭ 82 (+78.26%)
Mutual labels:  russian
Kaleido-BERT
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Stars: ✭ 252 (+447.83%)
Mutual labels:  bert
pn-summary
A well-structured summarization dataset for the Persian language!
Stars: ✭ 29 (-36.96%)
Mutual labels:  bert
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+52.17%)
Mutual labels:  bert

Neuro-comma

This library was developed with the idea to help us to create punctuation restoration models to memorize trained parameters, data, training visualization, etc. The Library doesn't use any high-level frameworks, such as PyTorch-lightning or Keras, to reduce the level entry threshold.

Feel free to fork this repo and edit model or dataset classes for your purposes.

Prerequirements

  • Python 3.9 for training
  • Docker for production

Why development environment is Python 3.9 if production environment in Dockerfile is 3.8?

  • Our team always uses the latest version and features of Python. We started with Python 3.9, but realized, that there is no FastAPI image for Python 3.9. There is several PRs in image repositories, but no response from maintainers. So we decided to change code which we use in production to work with the 3.8 version of Python. In some functions we have 3.9 code, but we still use them, these functions are needed only for development purposes.

Installation

  • Option 1:
    pip install -U pip wheel setuptools
    pip install -r requirements.txt
  • Option 2:
    sh scripts/installation.sh

Python module usage

Production usage

  • Choose model from releases section
  • Checkout to release tag!
  • Download and unzip model
  • Run docker-compose
    docker-compose up -d
  • Stop container
    docker-compose down

Model training

Model training from scratch:

python src/train.py \
    --model-name repunct-model \
    --pretrained-model DeepPavlov/rubert-base-cased-sentence \
    --targets O COMMA PERIOD \
    --train-data data/repunct/train \
    --val-data data/repunct/test \
    --test-data data/repunct/test \
    --store-best-weights \
    --epoch 7 \
    --batch-size 4 \
    --augment-rate 0.15 \
    --labml \
    --seed 1 \
    --cuda 

Fine-tuning already trained model. Add --fine-tune argument, this will load params from repunct-model and apply them to training function. This will create new subdirectory with {model-name}_ft name in models/ directory. Source model will be untouched.

python src/train.py \
    --model-name repunct-model \
    --fine-tune \
    --targets O COMMA PERIOD \
    --train-data data/repunct/train \
    --val-data data/repunct/test \
    --test-data data/repunct/test \
    --store-best-weights \
    --epoch 3 \
    --batch-size 4 \
    --labml \
    --seed 1 \
    --cuda 

In some cases you want to resume training (computer crashed, light blinked, etc.). This will resume training from last model checkpoint (saved weight). Just add --resume argument.

python src/train.py \
    --model-name repunct-model \
    --resume \
    --pretrained-model DeepPavlov/rubert-base-cased-sentence \
    --targets O COMMA PERIOD \
    --train-data data/repunct/train \
    --val-data data/repunct/test \
    --test-data data/repunct/test \
    --store-best-weights \
    --epoch 4 \
    --batch-size 4 \
    --augment-rate 0.15 \
    --labml \
    --seed 1 \
    --cuda 

More examples here

How it works

Before inserting raw text into model it should be tokenized. Library handle it with BaseDataset.parse_tokens

Model architecture is pretty easy and straight forward:

Model architecture

Credits

Our article on habr.ru

This repository contains code (which was edited for production purposes) from xashru/punctuation-restoration.

Special thanks to @akvarats

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].