Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → IntuitionEngineeringTeam → Chars2vec

IntuitionEngineeringTeam / Chars2vec

Licence: apache-2.0

Character-based word embeddings model based on RNN for handling real world texts

Programming Languages

python

139335 projects - #7 most used programming language

Labels

natural-language-processing language-model embeddings natural-language-understanding

Projects that are alternatives of or similar to Chars2vec

Easy Bert

A Dead Simple BERT API for Python and Java (https://github.com/google-research/bert)

Stars: ✭ 106 (-18.46%)

Mutual labels: natural-language-processing, language-model, natural-language-understanding

Tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Stars: ✭ 5,077 (+3805.38%)

Mutual labels: natural-language-processing, language-model, natural-language-understanding

Attention Mechanisms

Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.

Stars: ✭ 203 (+56.15%)

Mutual labels: natural-language-processing, language-model, natural-language-understanding

Spacy Transformers

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Stars: ✭ 919 (+606.92%)

Mutual labels: natural-language-processing, language-model, natural-language-understanding

Catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Stars: ✭ 224 (+72.31%)

Mutual labels: natural-language-processing, embeddings, natural-language-understanding

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+42778.46%)

Mutual labels: natural-language-processing, language-model, natural-language-understanding

Spark Nlp Models

Models and Pipelines for the Spark NLP library

Stars: ✭ 88 (-32.31%)

Mutual labels: natural-language-processing, natural-language-understanding

Bert As Service

Mapping a variable-length sentence to a fixed-length vector using BERT model

Stars: ✭ 9,779 (+7422.31%)

Mutual labels: natural-language-processing, natural-language-understanding

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+972.31%)

Mutual labels: natural-language-processing, embeddings

Turkish Morphology

A two-level morphological analyzer for Turkish.

Stars: ✭ 121 (-6.92%)

Mutual labels: natural-language-processing, natural-language-understanding

Intent classifier

Stars: ✭ 67 (-48.46%)

Mutual labels: natural-language-processing, natural-language-understanding

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-20.77%)

Mutual labels: natural-language-processing, natural-language-understanding

Chatbot

Русскоязычный чатбот

Stars: ✭ 106 (-18.46%)

Mutual labels: natural-language-processing, natural-language-understanding

Greek Bert

A Greek edition of BERT pre-trained language model

Stars: ✭ 84 (-35.38%)

Mutual labels: natural-language-processing, language-model

Dialogue Understanding

This repository contains PyTorch implementation for the baseline models from the paper Utterance-level Dialogue Understanding: An Empirical Study

Stars: ✭ 77 (-40.77%)

Mutual labels: natural-language-processing, natural-language-understanding

Chinese nlu by using rasa nlu

使用 RASA NLU 来构建中文自然语言理解系统（NLU）| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)

Stars: ✭ 99 (-23.85%)

Mutual labels: natural-language-processing, natural-language-understanding

Mt Dnn

Multi-Task Deep Neural Networks for Natural Language Understanding

Stars: ✭ 72 (-44.62%)

Mutual labels: natural-language-processing, natural-language-understanding

Xlnet extension tf

XLNet Extension in TensorFlow

Stars: ✭ 109 (-16.15%)

Mutual labels: natural-language-processing, natural-language-understanding

Awesome Embedding Models

A curated list of awesome embedding models tutorials, projects and communities.

Stars: ✭ 1,486 (+1043.08%)

Mutual labels: natural-language-processing, embeddings

Deep Nlp Seminars

Materials for deep NLP course

Stars: ✭ 113 (-13.08%)

Mutual labels: natural-language-processing, natural-language-understanding

View All Similar Projects ➔

chars2vec

Character-based word embeddings model based on RNN

Chars2vec library could be very useful if you are dealing with the texts containing abbreviations, slang, typos, or some other specific textual dataset. Chars2vec language model is based on the symbolic representation of words – the model maps each word to a vector of a fixed length. These vector representations are obtained with a custom neural network while the latter is being trained on pairs of similar and non-similar words. This custom neural net includes LSTM, reading sequences of characters in words, as its part. The model maps similarly written words to proximal vectors. This approach enables creation of an embedding in vector space for any sequence of characters. Chars2vec models does not keep any dictionary of embeddings, but generates embedding vectors inplace using pretrained model.

There are pretrained models of dimensions 50, 100, 150, 200 and 300 for the English language. The library provides convenient user API to train a model for an arbitrary set of characters. Read more details about the architecture of Chars2vec: Character-based language model for handling real world texts with spelling errors and human slang in Hacker Noon.

Model available for Python 2.7 and 3.0+.

Installation

1. Build and install from source

Download project source and run in your command line

>> python setup.py install

2. Via pip

Run in your command line

>> pip install chars2vec

Usage

Function chars2vec.load_model(str path) initializes the model from directory and returns chars2vec.Chars2Vec object. There are 5 pretrained English model with dimensions: 50, 100, 150, 200 and 300. To load this pretrained models:

import chars2vec

# Load Inutition Engineering pretrained model
# Models names: 'eng_50', 'eng_100', 'eng_150', 'eng_200', 'eng_300'
c2v_model = chars2vec.load_model('eng_50')

Method chars2vec.Chars2Vec.vectorize_words(words) returns numpy.ndarray of shape (n_words, dim) with word embeddings.

words = ['list', 'of', 'words']

# Create word embeddings
word_embeddings = c2v_model.vectorize_words(words)

Training

Function chars2vec.train_model(int emb_dim, X_train, y_train, model_chars) creates and trains new chars2vec model and returns chars2vec.Chars2Vec object.

Parameter emb_dim is a dimension of the model.

Parameter X_train is a list or numpy.ndarray of word pairs. Parameter y_train is a list or numpy.ndarray of target values that describe the proximity of words.

Training set (X_train, y_train) consists of pairs of "similar" and "not similar" words; a pair of "similar" words is labeled with 0 target value, and a pair of "not similar" with 1.

Parameter model_chars is a list of chars for the model. Characters which are not in the model_chars list will be ignored by the model.

Read more about chars2vec training and generation of training dataset in article about chars2vec.

Function chars2vec.save_model(c2v_model, str path_to_model) saves the trained model to the directory.

import chars2vec

dim = 50
path_to_model = 'path/to/model/directory'

X_train = [('mecbanizing', 'mechanizing'), # similar words, target is equal 0
           ('dicovery', 'dis7overy'), # similar words, target is equal 0
           ('prot$oplasmatic', 'prtoplasmatic'), # similar words, target is equal 0
           ('copulateng', 'lzateful'), # not similar words, target is equal 1
           ('estry', 'evadin6'), # not similar words, target is equal 1
           ('cirrfosis', 'afear') # not similar words, target is equal 1
          ]

y_train = [0, 0, 0, 1, 1, 1]

model_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.',
               '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<',
               '=', '>', '?', '@', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
               'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
               'x', 'y', 'z']

# Create and train chars2vec model using given training data
my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)

# Save your pretrained model
chars2vec.save_model(my_c2v_model, path_to_model)

# Load your pretrained model 
c2v_model = chars2vec.load_model(path_to_model)

Full code examples for usage and training models see in example_usage.py and example_training.py files.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 130

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗