All Projects → mayabot → Fasttext4j

mayabot / Fasttext4j

Licence: other
Implementing Facebook's FastText with java

Programming Languages

java
68154 projects - #9 most used programming language
kotlin
9241 projects

Projects that are alternatives of or similar to Fasttext4j

Embedding
Embedding模型代码和学习笔记总结
Stars: ✭ 25 (-83.11%)
Mutual labels:  word2vec, fasttext
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-77.03%)
Mutual labels:  word2vec, fasttext
word embedding
Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..
Stars: ✭ 21 (-85.81%)
Mutual labels:  word2vec, fasttext
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+51.35%)
Mutual labels:  word2vec, fasttext
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-4.73%)
Mutual labels:  word2vec, fasttext
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (-33.11%)
Mutual labels:  word2vec, fasttext
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+172.97%)
Mutual labels:  word2vec, fasttext
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (+2.03%)
Mutual labels:  word2vec, fasttext
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+841.89%)
Mutual labels:  word2vec, fasttext
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+771.62%)
Mutual labels:  word2vec, fasttext
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+32.43%)
Mutual labels:  word2vec, fasttext
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-14.19%)
Mutual labels:  word2vec, fasttext
Wordvectors
Pre-trained word vectors of 30+ languages
Stars: ✭ 2,043 (+1280.41%)
Mutual labels:  word2vec, fasttext
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-84.46%)
Mutual labels:  word2vec, fasttext
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+8523.65%)
Mutual labels:  word2vec, fasttext
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (-79.73%)
Mutual labels:  word2vec, fasttext
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (-76.35%)
Mutual labels:  word2vec, fasttext
Nlp
兜哥出品 <一本开源的NLP入门书籍>
Stars: ✭ 1,677 (+1033.11%)
Mutual labels:  word2vec, fasttext
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-1.35%)
Mutual labels:  word2vec, fasttext
Hierarchical Attention Network
Implementation of Hierarchical Attention Networks in PyTorch
Stars: ✭ 120 (-18.92%)
Mutual labels:  word2vec

FastText4j implementing FastText with Kotlin&Java. Fasttext is a library for text representation and classification by facebookresearch.

FastText4j是java&kotlin开发的fasttext算法库。Fasttext 是由facebookresearch开发的一个文本分类和词向量的库。

代码迁移至Mynlp项目 https://github.com/mayabot/mynlp/tree/master/fasttext

New code move to Mynlp project https://github.com/mayabot/mynlp/tree/master/fasttext

Features:

  • Implementing with java(kotlin)
  • Well-designed API
  • Compatible with original C++ model file (include quantizer compression model)
  • Provides train、test etc. api (almost the same performance)
  • Support for java file formats( can read file use mmap),read big model file with less memory

Features:

  • 100%由kotlin&java实现
  • 良好的API
  • 兼容官方原版的预训练模型
  • 提供所有的包括train、test等api
  • 支持自有模型存储格式,可以使用MMAP快速加载大模型

Installing

Gradle

compile 'com.mayabot.mynlp:fastText4j:3.1.2'

Maven

<dependency>
  <groupId>com.mayabot.mynlp</groupId>
  <artifactId>fastText4j</artifactId>
  <version>3.1.2</version>
</dependency>

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
inputArgs.setEpoch(20);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

主要参数说明:

  • loss 损失函数
    • hs 分层softmax.比完全softmax慢一点。 分层softmax是完全softmax损失的近似值,它允许有效地训练大量类。 还请注意,这种损失函数被认为是针对不平衡的label class,即某些label比其他label更多出现在样本。 如果您的数据集每个label的示例数量均衡,则值得尝试使用负采样损失(-loss ns -neg 100)。
    • ns NegativeSamplingLoss 负采样
    • softmax default for Supervised model
    • ova one-vs-all 可用于多分类.“OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.
  • lr 学习率learn rate
  • dim 向量维度
  • epoch 迭代次数 训练数据格式:

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string label. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

训练数据是个纯文本文件,每一行一条数据,词之间使用空格分开,每一行必须包含至少一个label标签。默认 情况下,是一个带__label__前缀的字符串。

__label__tag1 saints rally to beat 49ers the new orleans saints survived it all hurricane ivan

__label__积极 这个 商品 很 好 用 。

2. word representation learning | 词向量学习

支持cow和Skipgram两种模型

FastText.trainCow(file,inputArgs)
//Or
FastText.trainSkipgram(file,inputArgs)

Test model

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

model.test(new File("data/agnews/ag.test"),1,0,true);

output:

F1-Score : 0.968954 Precision : 0.960683 Recall : 0.977368  __label__2
F1-Score : 0.882043 Precision : 0.882508 Recall : 0.881579  __label__3
F1-Score : 0.890173 Precision : 0.888772 Recall : 0.891579  __label__4
F1-Score : 0.917353 Precision : 0.926463 Recall : 0.908421  __label__1
N	7600
[email protected]	0.915
[email protected]	0.915

Save model | 保存模型文件

FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModel(new File("path/data.model"));

Load model | 加载模型

//load from java format 
FastText model = FastText.Companion.loadModel(new File(""),false);
//load from c++ format
FastText model = FastText.Companion.loadCppModel(new File("path/wiki.bin"))

Quantizer compression | 乘积量化压缩

分类的模型可以压缩模型体积
//load from java format 
FastText qmodel = model.quantize(2, false, false);

Predict | 预测分类

List<ScoreLabelPair> result = model.predict(Arrays.asList("fastText 在 预测 标签 时 使用 了 非线性 激活 函数".split(" ")), 5,0);

Nearest Neighbor Search | 词向量近邻

List<ScoreLabelPair> result = model.nearestNeighbor("中国",5);

Analogies | 类比

By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).

List<ScoreLabelPair> result = fastText.analogies("国王","皇后","男",5);

Parameter | 参数

InputArgs可以设置各种参数,兼容fasttext原版参数。

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

Resource

Official pre-trained model

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].