Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

bakrianoo / Aravec

Labels

jupyter-notebook nlp word2vec text-mining gensim arabic

Projects that are alternatives of or similar to Aravec

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+230.54%)

Mutual labels: jupyter-notebook, word2vec, text-mining, gensim

Tadw

An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).

Stars: ✭ 43 (-82.01%)

Mutual labels: word2vec, text-mining, gensim

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (-20.92%)

Mutual labels: jupyter-notebook, word2vec, gensim

Log Anomaly Detector

Log Anomaly Detection - Machine learning to detect abnormal events logs

Stars: ✭ 169 (-29.29%)

Mutual labels: jupyter-notebook, word2vec, gensim

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (-17.99%)

Mutual labels: word2vec, text-mining, gensim

Twitter sentiment analysis word2vec convnet

Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network

Stars: ✭ 24 (-89.96%)

Mutual labels: jupyter-notebook, word2vec, gensim

Word2vec

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Stars: ✭ 48 (-79.92%)

Mutual labels: jupyter-notebook, word2vec, gensim

Book deeplearning in pytorch source

Stars: ✭ 236 (-1.26%)

Mutual labels: jupyter-notebook, word2vec

Wordembeddings Elmo Fasttext Word2vec

Using pre trained word embeddings (Fasttext, Word2Vec)

Stars: ✭ 146 (-38.91%)

Mutual labels: word2vec, gensim

Webvectors

Web-ify your word2vec: framework to serve distributional semantic models online

Stars: ✭ 154 (-35.56%)

Mutual labels: word2vec, gensim

Deep Math Machine Learning.ai

A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.

Stars: ✭ 173 (-27.62%)

Mutual labels: jupyter-notebook, word2vec

Turkish Word2vec

Pre-trained Word2Vec Model for Turkish

Stars: ✭ 136 (-43.1%)

Mutual labels: word2vec, gensim

Role2vec

A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).

Stars: ✭ 134 (-43.93%)

Mutual labels: word2vec, gensim

Textfeatures

👷‍♂️ A simple package for extracting useful features from character objects 👷‍♀️

Stars: ✭ 148 (-38.08%)

Mutual labels: word2vec, text-mining

Ml Projects

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (-46.86%)

Mutual labels: word2vec, gensim

Keywords2vec

Stars: ✭ 121 (-49.37%)

Mutual labels: jupyter-notebook, text-mining

Gensim

Topic Modelling for Humans

Stars: ✭ 12,763 (+5240.17%)

Mutual labels: word2vec, gensim

Debiaswe

Remove problematic gender bias from word embeddings.

Stars: ✭ 175 (-26.78%)

Mutual labels: jupyter-notebook, word2vec

Nlp profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

Stars: ✭ 181 (-24.27%)

Mutual labels: jupyter-notebook, text-mining

Gwu data mining

Materials for GWU DNSC 6279 and DNSC 6290.

Stars: ✭ 217 (-9.21%)

Mutual labels: jupyter-notebook, text-mining

View All Similar Projects ➔

AraVec 3.0

Advancements in neural networks have led to developments in fields like computer vision, speech recognition and natural language processing (NLP). One of the most influential recent developments in NLP is the use of word embeddings, where words are represented as vectors in a continuous space, capturing many syntactic and semantic relations among them.

AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models. The first version of AraVec provides six different word embedding models built on top of three different Arabic content domains; Tweets and Wikipedia This paper describes the resources used for building the models, the employed data cleaning techniques, the carried out preprocessing step, as well as the details of the employed word embedding creation techniques.

The third version of AraVec provides 16 different word embedding models built on top of two different Arabic content domains; Tweets and Wikipedia Arabic articles. The major difference between this version and the previous ones, is that the we produced two different types of models, unigrams and n-grams models. We utilized set of statistical techniques to genrate the most common used n-grams of each data domain.

Twitter tweets
Wikipedia Arabic articles

By total tokens of more than 1,169,075,128 tokens.

Take a look on how ngrams models are represented:

Please view the results page for more queries.

Citation

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

Read the Full-Text Paper

How To Use

These models were built using gensim Python library. Here's a simple code for loading and using one of the models by following these steps:

Install gensim >= 3.4 and nltk >= 3.2 using either pip or conda

pip install gensim nltk

conda install gensim nltk

extract the compressed model files to a directory [ e.g. Twittert-CBOW ]
keep the .npy files. You are gonna to load the file with no extension, like what you'll see in the following code.
run this python code to load and use the model

How to integrate AraVec with Spacy.io

NoteBook Codes

Code Samples

# -*- coding: utf8 -*-
import gensim
import re
import numpy as np
from nltk import ngrams

from utilities import * # import utilities.py module

# ============================   
# ====== N-Grams Models ======

t_model = gensim.models.Word2Vec.load('models/full_grams_cbow_100_twitter.mdl')

# python 3.X
token = clean_str(u'ابو تريكه').replace(" ", "_")
# python 2.7
# token = clean_str(u'ابو تريكه'.decode('utf8', errors='ignore')).replace(" ", "_")

if token in t_model.wv:
    most_similar = t_model.wv.most_similar( token, topn=10 )
    for term, score in most_similar:
        term = clean_str(term).replace(" ", "_")
        if term != token:
            print(term, score)

# تريكه 0.752911388874054
# حسام_غالي 0.7516342401504517
# وائل_جمعه 0.7244222164154053
# وليد_سليمان 0.7177559733390808
# ...

# =========================================
# == Get the most similar tokens to a compound query
# most similar to 
# عمرو دياب + الخليج - مصر

pos_tokens=[ clean_str(t.strip()).replace(" ", "_") for t in ['عمرو دياب', 'الخليج'] if t.strip() != ""]
neg_tokens=[ clean_str(t.strip()).replace(" ", "_") for t in ['مصر'] if t.strip() != ""]

vec = calc_vec(pos_tokens=pos_tokens, neg_tokens=neg_tokens, n_model=t_model, dim=t_model.vector_size)

most_sims = t_model.wv.similar_by_vector(vec, topn=10)
for term, score in most_sims:
    if term not in pos_tokens+neg_tokens:
        print(term, score)

# راشد_الماجد 0.7094649076461792
# ماجد_المهندس 0.6979793906211853
# عبدالله_رويشد 0.6942606568336487
# ...

# ====================
# ====================




# ============================== 
# ====== Uni-Grams Models ======

t_model = gensim.models.Word2Vec.load('models/full_uni_cbow_100_twitter.mdl')

# python 3.X
token = clean_str(u'تونس')
# python 2.7
# token = clean_str('تونس'.decode('utf8', errors='ignore'))

most_similar = t_model.wv.most_similar( token, topn=10 )
for term, score in most_similar:
    print(term, score)

# ليبيا 0.8864325284957886
# الجزائر 0.8783721327781677
# السودان 0.8573237061500549
# مصر 0.8277812600135803
# ...



# get a word vector
word_vector = t_model.wv[ token ]

Download

N-Grams Models

to take a look on what we can retieve from the n-grams models using some most similar queries. Please view the results page

N-Grams Models

Model	Docs No.	Vocabularies No.	Vec-Size	Download
Twitter-CBOW	66,900,000	1,476,715	300	Download
Twitter-CBOW	66,900,000	1,476,715	100	Download
Twitter-SkipGram	66,900,000	1,476,715	300	Download
Twitter-SkipGram	66,900,000	1,476,715	100	Download
Wikipedia-CBOW	1,800,000	662,109	300	Download
Wikipedia-CBOW	1,800,000	662,109	100	Download
Wikipedia-SkipGram	1,800,000	662,109	300	Download
Wikipedia-SkipGram	1,800,000	662,109	100	Download

Unigrams Models

Model	Docs No.	Vocabularies No.	Vec-Size	Download
Twitter-CBOW	66,900,000	1,259,756	300	Download
Twitter-CBOW	66,900,000	1,259,756	100	Download
Twitter-SkipGram	66,900,000	1,259,756	300	Download
Twitter-SkipGram	66,900,000	1,259,756	100	Download
Wikipedia-CBOW	1,800,000	320,636	300	Download
Wikipedia-CBOW	1,800,000	320,636	100	Download
Wikipedia-SkipGram	1,800,000	320,636	300	Download
Wikipedia-SkipGram	1,800,000	320,636	100	Download

Citation

Read the Full-Text Paper

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 239

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗