Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → cvikasreddy → skt

cvikasreddy / skt

Licence: other

Sanskrit compound segmentation using seq2seq model

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-translation seq2seq word-segmentation sanskrit-segmentation sanskrit

Projects that are alternatives of or similar to skt

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+11890.48%)

Mutual labels: machine-translation, seq2seq

Neural machine translation implementation using dynet's python bindings

Stars: ✭ 17 (-19.05%)

Mutual labels: machine-translation, seq2seq

SequenceToSequence

A seq2seq with attention dialogue/MT model implemented by TensorFlow.

Stars: ✭ 11 (-47.62%)

Mutual labels: machine-translation, seq2seq

Minimal Seq2Seq model with Attention for Neural Machine Translation in PyTorch

Stars: ✭ 552 (+2528.57%)

Mutual labels: machine-translation, seq2seq

Machine Translation

Stars: ✭ 51 (+142.86%)

Mutual labels: machine-translation, seq2seq

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Stars: ✭ 990 (+4614.29%)

Mutual labels: machine-translation, seq2seq

Minimalist NMT for educational purposes

Stars: ✭ 420 (+1900%)

Mutual labels: machine-translation, seq2seq

🤖 Neural SPARQL Machines for Knowledge Graph Question Answering.

Stars: ✭ 156 (+642.86%)

Mutual labels: machine-translation, seq2seq

Lingvo

Stars: ✭ 2,361 (+11142.86%)

Mutual labels: machine-translation, seq2seq

Transformer Temporal Tagger

Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

Stars: ✭ 55 (+161.9%)

Mutual labels: seq2seq

extreme-adaptation-for-personalized-translation

Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"

Stars: ✭ 42 (+100%)

Mutual labels: machine-translation

Beam search for neural network sequence to sequence (encoder-decoder) models.

Stars: ✭ 31 (+47.62%)

Mutual labels: seq2seq

NeuralCodeTranslator

Neural Code Translator provides instructions, datasets, and a deep learning infrastructure (based on seq2seq) that aims at learning code transformations

Stars: ✭ 32 (+52.38%)

Mutual labels: seq2seq

Distill-BERT-Textgen

Research code for ACL 2020 paper: "Distilling Knowledge Learned in BERT for Text Generation".

Stars: ✭ 121 (+476.19%)

Mutual labels: machine-translation

bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

Stars: ✭ 181 (+761.9%)

Mutual labels: machine-translation

CKIP CoreNLP Toolkits

Stars: ✭ 92 (+338.1%)

Mutual labels: word-segmentation

seq2seq-autoencoder

Theano implementation of Sequence-to-Sequence Autoencoder

Stars: ✭ 12 (-42.86%)

Mutual labels: seq2seq

Vietnamese Word Tokenize

Stars: ✭ 45 (+114.29%)

Mutual labels: word-segmentation

tai5-uan5 gian5-gi2 kang1-ku7

臺灣言語工具

Stars: ✭ 79 (+276.19%)

Mutual labels: machine-translation

convolutional seq2seq

fairseq: Convolutional Sequence to Sequence Learning (Gehring et al. 2017) by Chainer

Stars: ✭ 63 (+200%)

Mutual labels: seq2seq

View All Similar Projects ➔

Sanskrit compound segmentation using seq2seq model

Code for the paper titled 'Building a Word Segmenter for Sanskrit Overnight'

Instructions

Pre-requisites

The following python packages are required to be installed:

Tensorflow: https://www.tensorflow.org/, tensorflow version 0.12.1 was used for this project.

File organization

Data is located in data/.
Logs generated by tensorflow summarywriter is stored in logs/.
Models which are trained are stored in models/. Before training make sure the folders logs/ and models/ are created.

Training

The file train.py can be used to train the model. The file test.py can be used to test the model.

Data

data/ folder contains all the data used for the segmentation task.

All the .txt files are already tokenized using sentencepiece, the m.vocab and m.model files are the onces generated by sentencepiece and they can be used to tokenize any other data with the same vocabulary.

All .txt files contain data which are separeted by a new line(\n).

Training data:

dcs_data_input_train_sent.txt file contains the input sentences used for training.
dcs_data_output_train_sent.txt file contains the output words(segmented forms of input) used for training.

Testing data:

dcs_data_input_test_sent.txt file contains the input sentences used for testing.
dcs_data_output_test_sent.txt file contains the output words(segmented forms of input) used for testing.

Once the train.py file is run, it creates various other files for word2id, id2word, ... more details are provided in the utils/data_loader.py file.

Testing on other data

To test on your your own data, create a file with all the sentences that are to segmented, one sentence per line and run the unsupervised segmenter(sentencepiece) using the m.vocab and m.model files present in the utils/ folder. Also, create another file with ground truth outputs again tokenized with sentencepiece. If you only intend to get the output from the model and not compare it with your ground truth data, then keep an empty file with same number of lines as the input data file.

Then modify the test.py with the appropriate path and run it.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 21

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗