Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mayabot → Fasttext4j

mayabot / Fasttext4j

Licence: other

Implementing Facebook's FastText with java

Programming Languages

68154 projects - #9 most used programming language

9241 projects

Labels

word2vec fasttext

Projects that are alternatives of or similar to Fasttext4j

Embedding模型代码和学习笔记总结

Stars: ✭ 25 (-83.11%)

Mutual labels: word2vec, fasttext

Neural Networks

All about Neural Networks!

Stars: ✭ 34 (-77.03%)

Mutual labels: word2vec, fasttext

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..

Stars: ✭ 21 (-85.81%)

Mutual labels: word2vec, fasttext

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Stars: ✭ 224 (+51.35%)

Mutual labels: word2vec, fasttext

NLP research：基于tensorflow的nlp深度学习项目，支持文本分类/句子匹配/序列标注/文本生成四大任务

Stars: ✭ 141 (-4.73%)

Mutual labels: word2vec, fasttext

Simple-Sentence-Similarity

Exploring the simple sentence similarity measurements using word embeddings

Stars: ✭ 99 (-33.11%)

Mutual labels: word2vec, fasttext

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (+172.97%)

Mutual labels: word2vec, fasttext

Embedding As Service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

Stars: ✭ 151 (+2.03%)

Mutual labels: word2vec, fasttext

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+841.89%)

Mutual labels: word2vec, fasttext

Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation)，etc. All codes are implemented intensorflow 2.0.

Stars: ✭ 1,290 (+771.62%)

Mutual labels: word2vec, fasttext

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+32.43%)

Mutual labels: word2vec, fasttext

FastText for Node.js

Stars: ✭ 127 (-14.19%)

Mutual labels: word2vec, fasttext

Pre-trained word vectors of 30+ languages

Stars: ✭ 2,043 (+1280.41%)

Mutual labels: word2vec, fasttext

🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/

Stars: ✭ 23 (-84.46%)

Mutual labels: word2vec, fasttext

Topic Modelling for Humans

Stars: ✭ 12,763 (+8523.65%)

Mutual labels: word2vec, fasttext

Persian-Sentiment-Analyzer

Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )

Stars: ✭ 30 (-79.73%)

Mutual labels: word2vec, fasttext

Finalfusion Rust

finalfusion embeddings in Rust

Stars: ✭ 35 (-76.35%)

Mutual labels: word2vec, fasttext

兜哥出品 <一本开源的NLP入门书籍>

Stars: ✭ 1,677 (+1033.11%)

Mutual labels: word2vec, fasttext

Wordembeddings Elmo Fasttext Word2vec

Using pre trained word embeddings (Fasttext, Word2Vec)

Stars: ✭ 146 (-1.35%)

Mutual labels: word2vec, fasttext

Hierarchical Attention Network

Implementation of Hierarchical Attention Networks in PyTorch

Stars: ✭ 120 (-18.92%)

Mutual labels: word2vec

View All Similar Projects ➔

FastText4j implementing FastText with Kotlin&Java. Fasttext is a library for text representation and classification by facebookresearch.

FastText4j是java&kotlin开发的fasttext算法库。Fasttext 是由facebookresearch开发的一个文本分类和词向量的库。

代码迁移至Mynlp项目 https://github.com/mayabot/mynlp/tree/master/fasttext 。

New code move to Mynlp project https://github.com/mayabot/mynlp/tree/master/fasttext

Features:

Implementing with java(kotlin)
Well-designed API
Compatible with original C++ model file (include quantizer compression model)
Provides train、test etc. api (almost the same performance)
Support for java file formats( can read file use mmap),read big model file with less memory

Features:

100%由kotlin&java实现
良好的API
兼容官方原版的预训练模型
提供所有的包括train、test等api
支持自有模型存储格式，可以使用MMAP快速加载大模型

Installing

Gradle

compile 'com.mayabot.mynlp:fastText4j:3.1.2'

Maven

<dependency>
  <groupId>com.mayabot.mynlp</groupId>
  <artifactId>fastText4j</artifactId>
  <version>3.1.2</version>
</dependency>

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
inputArgs.setEpoch(20);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

主要参数说明：

loss 损失函数
- hs 分层softmax.比完全softmax慢一点。分层softmax是完全softmax损失的近似值，它允许有效地训练大量类。还请注意，这种损失函数被认为是针对不平衡的label class，即某些label比其他label更多出现在样本。如果您的数据集每个label的示例数量均衡，则值得尝试使用负采样损失（-loss ns -neg 100）。
- ns NegativeSamplingLoss 负采样
- softmax default for Supervised model
- ova one-vs-all 可用于多分类.“OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.
lr 学习率learn rate
dim 向量维度
epoch 迭代次数训练数据格式:

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string label. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

训练数据是个纯文本文件，每一行一条数据，词之间使用空格分开，每一行必须包含至少一个label标签。默认情况下，是一个带__label__前缀的字符串。

__label__tag1 saints rally to beat 49ers the new orleans saints survived it all hurricane ivan

__label__积极这个商品很好用。

2. word representation learning | 词向量学习

支持cow和Skipgram两种模型

FastText.trainCow(file,inputArgs)
//Or
FastText.trainSkipgram(file,inputArgs)

Test model

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

model.test(new File("data/agnews/ag.test"),1,0,true);

output:

F1-Score : 0.968954 Precision : 0.960683 Recall : 0.977368  __label__2
F1-Score : 0.882043 Precision : 0.882508 Recall : 0.881579  __label__3
F1-Score : 0.890173 Precision : 0.888772 Recall : 0.891579  __label__4
F1-Score : 0.917353 Precision : 0.926463 Recall : 0.908421  __label__1
N	7600
[email protected]	0.915
[email protected]	0.915

Save model | 保存模型文件

FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModel(new File("path/data.model"));

Load model | 加载模型

//load from java format 
FastText model = FastText.Companion.loadModel(new File(""),false);

//load from c++ format
FastText model = FastText.Companion.loadCppModel(new File("path/wiki.bin"))

Quantizer compression | 乘积量化压缩

分类的模型可以压缩模型体积

//load from java format 
FastText qmodel = model.quantize(2, false, false);

Predict | 预测分类

List<ScoreLabelPair> result = model.predict(Arrays.asList("fastText 在 预测 标签 时 使用 了 非线性 激活 函数".split(" ")), 5,0);

Nearest Neighbor Search | 词向量近邻

List<ScoreLabelPair> result = model.nearestNeighbor("中国",5);

Analogies | 类比

By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).

List<ScoreLabelPair> result = fastText.analogies("国王","皇后","男",5);

Parameter | 参数

InputArgs可以设置各种参数，兼容fasttext原版参数。

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

Resource

Official pre-trained model

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 148

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗