All Projects → ollie283 → Language Models

ollie283 / Language Models

Build unigram and bigram language models, implement Laplace smoothing and use the models to compute the perplexity of test corpora.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Language Models

Jieba Php
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
Stars: ✭ 1,073 (+1718.64%)
Mutual labels:  natural-language-processing
Quaterniontransformers
Repository for ACL 2019 paper
Stars: ✭ 56 (-5.08%)
Mutual labels:  natural-language-processing
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-1.69%)
Mutual labels:  natural-language-processing
Emotion Detector
A python code to detect emotions from text
Stars: ✭ 54 (-8.47%)
Mutual labels:  natural-language-processing
Research papers
Record some papers I have read and paper notes I have taken, also including some awesome papers reading lists and academic blog posts.
Stars: ✭ 55 (-6.78%)
Mutual labels:  natural-language-processing
Joint Lstm Parser
Transition-based joint syntactic dependency parser and semantic role labeler using a stack LSTM RNN architecture.
Stars: ✭ 57 (-3.39%)
Mutual labels:  natural-language-processing
Nltk Book Resource
Notes and solutions to complement the official NLTK book
Stars: ✭ 54 (-8.47%)
Mutual labels:  natural-language-processing
Bidaf Keras
Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2
Stars: ✭ 60 (+1.69%)
Mutual labels:  natural-language-processing
Hmtl
🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP
Stars: ✭ 1,084 (+1737.29%)
Mutual labels:  natural-language-processing
Teapot Nlp
Tool for Evaluating Adversarial Perturbations on Text
Stars: ✭ 58 (-1.69%)
Mutual labels:  natural-language-processing
Vietnamese Electra
Electra pre-trained model using Vietnamese corpus
Stars: ✭ 55 (-6.78%)
Mutual labels:  natural-language-processing
Demos
Some JavaScript works published as demos, mostly ML or DS
Stars: ✭ 55 (-6.78%)
Mutual labels:  natural-language-processing
Mindspore Nlp Tutorial
Natural Language Processing Tutorial for MindSpore Users
Stars: ✭ 58 (-1.69%)
Mutual labels:  natural-language-processing
Scdv
Text classification with Sparse Composite Document Vectors.
Stars: ✭ 54 (-8.47%)
Mutual labels:  natural-language-processing
Nlg Rl
Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction
Stars: ✭ 59 (+0%)
Mutual labels:  natural-language-processing
Market Reporter
Automatic Generation of Brief Summaries of Time-Series Data
Stars: ✭ 54 (-8.47%)
Mutual labels:  natural-language-processing
Li emnlp 2017
Deep Recurrent Generative Decoder for Abstractive Text Summarization in DyNet
Stars: ✭ 56 (-5.08%)
Mutual labels:  natural-language-processing
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (+1.69%)
Mutual labels:  natural-language-processing
Botsharp
The Open Source AI Chatbot Platform Builder in 100% C# Running in .NET Core with Machine Learning algorithm.
Stars: ✭ 1,103 (+1769.49%)
Mutual labels:  natural-language-processing
Comet
A Neural Framework for MT Evaluation
Stars: ✭ 58 (-1.69%)
Mutual labels:  natural-language-processing

Language Models and Smoothing

There are two datasets.

Toy dataset: The files sampledata.txt, sampledata.vocab.txt, sampletest.txt comprise a small toy dataset. sampledata.txt is the training corpus and contains the following:

<s> a a b b c c </s> <s> a c b c </s> <s> b c c a b </s>

Treat each line as a sentence. <s> is the start of sentence symbol and </s> is the end of sentence symbol. To keep the toy dataset simple, characters a-z will each be considered as a word. i.e. The first sentence has 8 tokens, second has 6 tokens, and the last has 7.

The file sampledata.vocab.txt contains the vocabulary of the training data. It lists the 3 word types for the toy dataset:

a 
b 
c

sampletest.txt is the test corpus.

Actual data: The files train.txt, train.vocab.txt, and test.txt form a larger more realistic dataset. These files have been pre-processed to remove punctuation and all words have been converted to lower case. An example sentence in the train or test file has the following form:

<s> the anglo-saxons called april oster-monath or eostur-monath </s>

Again every space-separated token is a word. The above sentence has 9 tokens. The train.vocab.txt contains the vocabulary (types) in the training data.

Important: Note that the <s> or </s> are not included in the vocabulary files. The term UNK will be used to indicate words which have not appeared in the training data. UNK is also not included in the vocabulary files but you will need to add UNK to the vocabulary while doing computations. While computing the probability of a test sentence, any words not seen in the training data should be treated as a UNK token.

Important: You do not need to do any further preprocessing of the data. Simply split by space you will have the tokens in each sentence.

Implementation of the models

a) Write a function to compute unigram unsmoothed and smoothed models. Print out the unigram probabilities computed by each model for the Toy dataset.

b) Write a function to compute bigram unsmoothed and smoothed models. Print out the bigram probabilities computed by each model for the Toy dataset.

c) Write a function to compute sentence probabilities under a language model. Print out the probabilities of sentences in Toy dataset using the smoothed unigram and bigram models.

d) Write a function to return the perplexity of a test corpus given a particular language model. Print out the perplexities computed for sampletest.txt using a smoothed unigram model and a smoothed bigram model.

Run on large corpus

Now use the Actual dataset. Train smoothed unigram and bigram models on train.txt. Print out the perplexity under each model for

a) train.txt i.e. the same corpus you used to train the model.

b) test.txt

Code

Code should run without any arguments. It should read files in the same directory. Absolute paths must not be used. It should print values in the following format:

---------------- Toy dataset ---------------

=== UNIGRAM MODEL ===
 - Unsmoothed  -
a:0.0   b:0.0 ...
- Smoothed -
a:0.0   b:0.0 ...

=== BIGRAM MODEL === 
- Unsmoothed 
a 	b	 c    UNK 	</s> 
a 	0.0  ... 
b	... 
c 	... 
UNK	... 
<s>	...

- Smoothed -
a 	b	 c    UNK 	</s> 
a 	0.0  ... 
b	... 
c 	... 
UNK	... 
<s>	...

== SENTENCE PROBABILITIES == 
sent 		            uprob   biprob 
<s> a b c </s> 	        0.0 	0.0
 <s> a b b c c </s>     ...     ...
 
== TEST PERPLEXITY == 
unigram: 0.0 
bigram: 0.0

---------------- Actual dataset ----------------
PERPLEXITY of train.txt 
unigram: 0.0 
bigram: 0.0

PERPLEXITY of test.txt 
unigram: 0.0 
bigram: 0.0
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].