Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → GauravBh1010tt → Dl Text

GauravBh1010tt / Dl Text

Licence: mit

Text pre-processing library for deep learning (Keras, tensorflow).

Programming Languages

139335 projects - #7 most used programming language

Labels

deep-learning nlp-machine-learning

Projects that are alternatives of or similar to Dl Text

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-36.13%)

Mutual labels: nlp-machine-learning

⛔️ ARCHIVED ⛔️ 🐒 R package for text analysis with Monkeylearn 🐒

Stars: ✭ 95 (-20.17%)

Mutual labels: nlp-machine-learning

A python module for English lemmatization and inflection.

Stars: ✭ 105 (-11.76%)

Mutual labels: nlp-machine-learning

Text classification

Text Classification Algorithms: A Survey

Stars: ✭ 1,276 (+972.27%)

Mutual labels: nlp-machine-learning

Writeup Frontend

Beat Writer's Block with AI

Stars: ✭ 94 (-21.01%)

Mutual labels: nlp-machine-learning

Datasets, tools, and benchmarks for representation learning of code.

Stars: ✭ 1,378 (+1057.98%)

Mutual labels: nlp-machine-learning

自然语言处理领域下的对话语音领域，整理相关论文（附阅读笔记），复现模型以及数据处理等（代码含TensorFlow和PyTorch两版本）

Stars: ✭ 67 (-43.7%)

Mutual labels: nlp-machine-learning

package lingo provides the data structures and algorithms required for natural language processing

Stars: ✭ 113 (-5.04%)

Mutual labels: nlp-machine-learning

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

Stars: ✭ 95 (-20.17%)

Mutual labels: nlp-machine-learning

Textaugmentation Gpt2

Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.

Stars: ✭ 104 (-12.61%)

Mutual labels: nlp-machine-learning

Lda Topic Modeling

A PureScript, browser-based implementation of LDA topic modeling.

Stars: ✭ 91 (-23.53%)

Mutual labels: nlp-machine-learning

It consists of examples, assignments discussed in data science course taken at algorithmica.

Stars: ✭ 92 (-22.69%)

Mutual labels: nlp-machine-learning

《机器阅读理解：算法与实践》代码

Stars: ✭ 102 (-14.29%)

Mutual labels: nlp-machine-learning

Models for automatic abstractive summarization

Stars: ✭ 83 (-30.25%)

Mutual labels: nlp-machine-learning

Adversarial Training for Neural Relation Extraction

Stars: ✭ 108 (-9.24%)

Mutual labels: nlp-machine-learning

Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python

A guide on how to crack combinatorics puzzles shown in The Da Vinci Code movie using CS fundamentals and NLP

Stars: ✭ 75 (-36.97%)

Mutual labels: nlp-machine-learning

Question Generation

Given a sentence automatically generate reading comprehension style factual questions from that sentence, such that the sentence contains answers to those questions.

Stars: ✭ 100 (-15.97%)

Mutual labels: nlp-machine-learning

2018年机器阅读理解技术竞赛模型，国内外1000多支队伍中BLEU-4评分排名第6， ROUGE-L评分排名第14。（未ensemble，未嵌入训练好的词向量，无dropout）

Stars: ✭ 117 (-1.68%)

Mutual labels: nlp-machine-learning

Bertqa Attention On Steroids

BertQA - Attention on Steroids

Stars: ✭ 112 (-5.88%)

Mutual labels: nlp-machine-learning

R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation

Stars: ✭ 103 (-13.45%)

Mutual labels: nlp-machine-learning

View All Similar Projects ➔

DL-Text : pre-processing modules for deep learning (Keras, tensorflow).

This repository consists of modules for pre-processing the textual data. Examples are also given for training deep models (DNN, CNN, RNN, LSTM). There are many additional functionilities which are as follows:

Preparing data for problems like sentiment analysis, sentence contextual similarity, question answering, machine translation, etc.
Compute lexical and semantic hand-crafted features like words overlap, n-gram overlap, td-idf, count features, etc. Most of these features are used in the following papers:
- External features for community question answering.
- Voltron: A Hybrid System For Answer Validation Based On Lexical And Distance Features.
Implementation of deep models as described in the following papers (for reproducible code refer to DeepLearn-repo):
Implementation of evaluation metrics such as MAP, MRR, [email protected], BM25 etc.

Dependencies

The required dependencies are mentioned in requirement.txt. You can install them manually or by using the following command:

$ pip install -r requirements.txt

Prepare the data for NLP problems like sentiment analysis.

1. The data and labels looks like this:

raw_data = ['this,,, is$$ a positive ..sentence','this is a ((*negative ,,@sentence',
        'yet another..'' positive$$ sentence','the last one is ...,negative']
labels = [1,0,1,0]

This type of data is commonly used in sentiment analysis type problems. The first step is to clean the data:

from dl_text import dl
data = []
for sent in raw_data:
    data.append(dl.clean(sent))
    
print data
['this is a positive sentence', 'this is a negative sentence', 
'yet another positive sentence', 'the last one is negative']

Once the raw data is cleaned, the next step is the prepare that can be passed to the deep models. Use the following function:

data_inp = dl.process_data(sent_l = data, dimx = 10)

The process_data function preprocesses the data that can be used with deep models. The process_data has following parameters:

process_data(sent_l,sent_r,wordVec_model,dimx,dimy,vocab_size,embedding_dim)

where,

sent_l : data to be sent to training model (if you are using only one channel, as in the case of sentiment analysis, then use this parameter)
sent_r : data for the second channel (discussed later)
wordVec_model : pre-trained word vector embeddings (either GloVe or Word2vec)
dimx and dimy : number of words to be included (if a sentence has lesser words than this value, it will be padded by 0, otherwise extra words will be truncated)
vocab_size : number of unique words to be included in the vocabulary
embedding_dim : size of the embeddings for wordVec_models

2. Using pre-trained word vector embeddings

from dl_text import dl
import gensim

# for 50-dim glove embeddings use:
wordVec_model = dl.loadGloveModel('path_of_the_embeddings/glove.6B.50d.txt')

# for 300 dim word2vec embeddings use: 
wordVec_model = gensim.models.KeyedVectors.load_word2vec_format("path/GoogleNews-vectors-negative300.bin.gz",
                                                                 binary=True)

data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10)

3. Defining deep models

from dl_text import dl
from keras.layers import Input, Dense, Dropout, Merge, Conv1D, Lambda, Flatten, MaxPooling1D

def model_dnn(dimx, embedding_matrix):
    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   
    embed = dl.word2vec_embedding_layer(embedding_matrix)(inpx)
    flat_embed = Flatten()(embed)
    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)
    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)
    model = Model([inpx],nnet_out)
    model.compile(loss='mse',optimizer='adam')
    return model

def model_cnn(dimx, embedding_matrix):
    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   
    embed = dl.word2vec_embedding_layer(embedding_matrix)(inpx)
    sent = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embed)
    pool = MaxPooling1D()(sent)
    flat_embed = Flatten()(pool)
    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)
    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)
    model = Model([inpx],nnet_out)
    model.compile(loss='mse',optimizer='adam')
    return model

4. Training the models

data = ['this is a positive sentence', 'this is a negative sentence', 'yet another positive sentence', 'the last one is negative']
labels = [1,0,1,0]

data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10)

model = model_dnn(dimx = 10, embedding_matrix = embedding_matrix)
model.fit(data_inp, labels)

model = model_cnn(dimx = 10, embedding_matrix = embedding_matrix)
model.fit(data_inp, labels)

Prepare the data for NLP problems like computing sentence similarity, question answering, etc.

1. Creating two channel models

These type of models use two data streams. This can be used to NLP tasks such as question answering, sentence similarity computation, etc. The data looks like this

data_l = ['this is a positive sentence','this is a negative sentence', 
          'yet another positive sentence', 'the last one is negative']
          
data_r = ['positive words are good, better, best, etc.', 'negative words are bad, sad, etc.', 
          'feeling good', 'sooo depressed.']
         
labels = [1,0,1,0]

Here, data_l and data_r can be two sentences for computing sentence similarity, question-answer pairs for question answering problem, etc. Let's define a model for the these type of tasks

def model_cnn2(dimx, dimy, embedding_matrix):
    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   
    embedx = dl.word2vec_embedding_layer(embedding_matrix)(inpx)
    inpy = Input(shape=(dimx,),dtype='int32',name='inpy')   
    embedy = dl.word2vec_embedding_layer(embedding_matrix)(inpy)
    
    sent_l = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedx)
    sent_r = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedy)
    pool_l = MaxPooling1D()(sent_l)
    pool_r = MaxPooling1D()(sent_r)
    
    combine  = merge(mode='concat')([pool_l, pool_r])
    flat_embed = Flatten()(combine)
    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)
    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)
    model = Model([inpx],nnet_out)
    model.compile(loss='mse',optimizer='adam')
    
    return model

2. Tarining a two channel deep model

data_inp_l, data_inp_r, embedding_matrix = dl.process_data(sent_l = data_l, sent_r = data_r, 
                                                           wordVec_model = wordVec_model, dimx = 10, dimy = 10)

model = model_cnn2(dimx = 10, dimy = 10, embedding_matrix = embedding_matrix)
model.fit([data_inp_l, data_inp_r], labels)

Hand crafted features - These could be used with problems like sentence similarity, question answering, etc.

1. Computing lexical and semantic features.

>>> from dl_text import lex_sem_ft

>>> sent1 = 'i like natural language processing'
>>> sent2 = 'i like deep learning'

>>> lex_sem_ft.tokenize(sent1) # tokenizing a sentence
['i', 'like', 'natural', 'language', 'processing']

>>> lex_sem_ft.overlap(sent1,sent2) # number of words common
2

Functions currently present in the lex_sem_ft are:

tokenize(sent): tokenize a given string
length(sent) : Number Of Words In A String (Returns Integer)
substringCheck(sent1, sent2) : Whether A String Is Subset Of Other (Returns 1 and 0)
overlap(sent1, sent2): Number Of Same Words In Two Sentences (Returns Float)
overlapSyn(sent1, sent2): Number Of Synonyms In Two Sentences (Returns Float)
train_BOW(lst) : Forming Bag Of Words (BOW) (Returns BOW Dictionary)
Sum_BOW(sent, dic) : Sum Of BOW Values For A Sent (Returns Float)
train_bigram(lst) : Training Bigram Model (Returns Dictionary of Dictionaries)
sum_bigram(sent, model) : Total Sum Of Bigram Probablity Of A Sentence (Returns Float)
train_trigram(lst): Training Trigram Model (Returns Dictionary of Dictionaries)
sum_trigram(sent, model) : Total Sum Of Trigram Probablity Of A Sentence (Returns Float)
W2V_train(lst1, lst2) : Word2Vec Training (Returns Vector)
W2V_Vec(sent1, sent2, vec) : Returns The Difference Between Word2Vec Sum Of All The Words In Two Sentences (Returns Vec)
LDA_train(doc) : Trains LDA Model (Returns Model)
LDA(doc1, doc2, lda) : Returns Average Of Probablity Of Word Present In LDA Model For Input Document (Returns Float)

2. Computing text readability features.

>>> from dl_text import rd_ft

>>> sent1 = 'i like natural language processing'
>>> rd_ft.CPW(sent1) # average characters per word
6.0
>>> rd_ft.ED('good','great') # edit distance between two words
4.0

Functions currently present in the rd_ft are:

CPW(text) : Average Character Per Word In A Sentence (Returns Float)
WPS(text) : Number Of Words Per Sentence (Returns Integer)
SPW(text) : Average Number Of Syllables In Sentence (Returns Float)
LWPS(text) : Long Words In A Sentence (Returns Integer)
LWR(text) : Fraction Of Long Words In A Sentence (Returns Float)
CWPS(text) : Number Of Complex Word Per Sentence (Returns Float)
DaleChall(text) : Dale-Chall Readability Index (Returns Float)
ED(s1, s2) : Edit Distance Value For Two String (Returns Integer)
nouns(text) : Get A List Of Nouns From String (Returns List Of Sting)
EditDist_Dist(t1,t2) : Average Edit Distance Value For Two String And The Average Edit Distance Between The Nouns Present In Them (Returns Float)
LCS_Len(a, b) : Longest Common Subsequence (Returns Integer)
LCW(t1, t2) : Length Of Longest Common Subsequence (Returns Integer)

Training deep models using textutal sentences and hand features.

1. Preparing the data

from dl_text import dl
from dl_text import lex_sem_ft
from dl_text import rd_ft

data_l = ['this is a positive sentence','this is a negative sentence', 
          'yet another positive sentence', 'the last one is negative']
data_r = ['positive words are good, better, best, etc.', 'negative words are bad, sad, etc.', 
          'feeling good', 'sooo depressed.']
labels = [1,0,1,0]

wordVec_model = dl.loadGloveModel('path_of_the_embeddings/glove.6B.50d.txt')

all_feat = []
for i,j in zip(data_l, data_r):
    feat1 = lex_sem_ft.overlap(i, j)
    feat2 = lex_sem_ft.W2V_Vec(i, j, wordVec_model)
    feat3 = rd_ft.ED(i, j)
    feat4 = rd_ft.LCW(i, j)
    all_feat.append(feat1)
    all_feat.append(feat2)
    all_feat.append(feat3)
    all_feat.append(feat4)
    
data_inp_l, data_inp_r, embedding_matrix = dl.process_data(sent_l = data_l, sent_r = data_r, 
                                                           wordVec_model = wordVec_model, dimx = 10, dimy = 10)

2. Let's define a model for incorporating external features with deep models.

def model_cnn_ft(dimx, dimy, dimft, embedding_matrix):
    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   
    embedx = dl.word2vec_embedding_layer(embedding_matrix)(inpx)
    inpy = Input(shape=(dimx,),dtype='int32',name='inpy')   
    embedy = dl.word2vec_embedding_layer(embedding_matrix)(inpy)
    inpz = Input(shape=(dimft,),dtype='int32',name='inpz')
    
    sent_l = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedx)
    sent_r = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedy)
    pool_l = MaxPooling1D()(sent_l)
    pool_r = MaxPooling1D()(sent_r)
    
    combine  = merge(mode='concat')([pool_l, pool_r,inpz])
    flat_embed = Flatten()(combine)
    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)
    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)
    model = Model([inpx],nnet_out)
    model.compile(loss='mse',optimizer='adam')
    
    return model

3. Training the deep model.

model = model_cnn_ft(dimx = 10, dimy = 10, dimz = len(all_feat), embedding_matrix = embedding_matrix)
model.fit([data_inp_l, data_inp_r, all_feat], labels)

Evaluation metrics - MAP, MRR, [email protected], etc.

The mean average precision (MAP) and mean reciprocal recall (MRR) is computed as:

In our implementation we assume that the ground truth is arranged starting with the true labels and is/are followed by false labels.

>>> from dl_text import metrics
>>> pred = [[0,0,1],[0,0,1]] # we have two queries with 3 answers for each; 1 - relevant, 0 - irrelevant

'''Converting the prediction list to dictionary'''

>>> dict1 = {}
>>> for i,j in enumerate(pred):
        dict1[i] = j
        
>>> metrics.Map(dict1)
0.33
>>> metrics.Mrr(dict1)
33.33

>>> pred = [[0,1,1],[0,1,0]]
>>> for i,j in enumerate(pred):
        dict1[i] = j
>>> metrics.Map(dict1)
0.5416666666666666
>>> metrics.Mrr(dict1)
50.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 119

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗