All Projects → shibing624 → Text2vec

shibing624 / Text2vec

Licence: apache-2.0
text2vec, chinese text to vetor.(文本向量化表示工具,包括词向量化、句子向量化、句子相似度计算)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text2vec

Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+732.26%)
Mutual labels:  similarity, word2vec
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (+34.84%)
Mutual labels:  similarity, word2vec
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-18.06%)
Mutual labels:  word2vec
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-2.58%)
Mutual labels:  word2vec
Word2vec
Go library for performing computations in word2vec binary models
Stars: ✭ 143 (-7.74%)
Mutual labels:  word2vec
Pytorch word2vec
Use pytorch to implement word2vec
Stars: ✭ 133 (-14.19%)
Mutual labels:  word2vec
Skip Thoughts.torch
Porting of Skip-Thoughts pretrained models from Theano to PyTorch & Torch7
Stars: ✭ 146 (-5.81%)
Mutual labels:  word2vec
Hierarchical Attention Network
Implementation of Hierarchical Attention Networks in PyTorch
Stars: ✭ 120 (-22.58%)
Mutual labels:  word2vec
Skip Gram Pytorch
A complete pytorch implementation of skip-gram
Stars: ✭ 153 (-1.29%)
Mutual labels:  word2vec
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-9.03%)
Mutual labels:  word2vec
Word2vec Spam Filter
Using word vectors to classify spam messages
Stars: ✭ 149 (-3.87%)
Mutual labels:  word2vec
Word2vec
对 ansj 编写的 Word2VEC_java 的进一步包装,同时实现了常用的词语相似度和句子相似度计算。
Stars: ✭ 136 (-12.26%)
Mutual labels:  word2vec
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (-13.55%)
Mutual labels:  word2vec
Fasttext4j
Implementing Facebook's FastText with java
Stars: ✭ 148 (-4.52%)
Mutual labels:  word2vec
Scattertext Pydata
Notebooks for the Seattle PyData 2017 talk on Scattertext
Stars: ✭ 132 (-14.84%)
Mutual labels:  word2vec
Graphwavemachine
A scalable implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets (KDD 2018)".
Stars: ✭ 151 (-2.58%)
Mutual labels:  word2vec
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (-18.06%)
Mutual labels:  word2vec
Ml
sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
Stars: ✭ 136 (-12.26%)
Mutual labels:  word2vec
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-5.81%)
Mutual labels:  word2vec
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (-0.65%)
Mutual labels:  word2vec

text2vec

text2vec, chinese text to vetor.(文本向量化表示工具,包括词向量化、句子向量化)

Feature

文本向量表示

  • 字词粒度,通过腾讯AI Lab开源的大规模高质量中文词向量数据(800万中文词轻量版) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe),获取字词的word2vec向量表示。
  • 句子粒度,通过求句子中所有单词词嵌入的平均值计算得到。
  • 篇章粒度,可以通过gensim库的doc2vec得到,应用较少,本项目不实现。

文本相似度计算

  • 基准方法,估计两句子间语义相似度最简单的方法就是求句子中所有单词词嵌入的平均值,然后计算两句子词嵌入之间的余弦相似性。
  • 词移距离(Word Mover’s Distance),词移距离使用两文本间的词嵌入,测量其中一文本中的单词在语义空间中移动到另一文本单词所需要的最短距离。

query和docs的相似度比较

  • rank_bm25方法,使用bm25的变种算法,对query和文档之间的相似度打分,得到docs的rank排序。

Result

文本相似度计算

  • 基准方法

尽管文本相似度计算的基准方法很简洁,但用平均词嵌入之间求余弦相似度的表现非常好。实验有以下结论:

1. 简单word2vec嵌入比GloVe嵌入表现的好
2. 在用word2vec时,尚不清楚使用停用词表或TF-IDF加权是否更有帮助。在STS数据集上,有一点儿帮助;在SICK上没有帮助。
仅计算未加权的所有word2vec嵌入平均值表现得很好。
3. 在使用GloVe时,停用词列表对于达到好的效果非常重要。利用TF-IDF加权没有帮助。

基准方法效果很好

  • 词移距离

基于我们的结果,好像没有什么使用词移距离的必要了,因为上述方法表现得已经很好了。只有在STS-TEST数据集上,而且只有在有停止词列表的情况下,词移距离才能和简单基准方法一较高下。

词移距离的表现令人失望

Install

pip3 install text2vec

or

git clone https://github.com/shibing624/text2vec.git
cd text2vec
python3 setup.py install

Usage:

  • download embedding file:

以下词向量,任选一个:

轻量版腾讯词向量,二进制,111MB放到 ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

腾讯词向量, 6.78G放到: ~/.text2vec/datasets/Tencent_AILab_ChineseEmbedding.txt

  • get text vector

import text2vec

char = '卡'
result = text2vec.encode(char)
print(type(result))
print(char, result)

word = '银行卡'
print(word, text2vec.encode(word))

a = '如何更换花呗绑定银行卡'
emb = text2vec.encode(a)
print(a, emb)

output:

<class 'numpy.ndarray'>
卡 [ 0.06761453 -0.10960816 -0.04829824  0.0156597  -0.09412017 -0.04805465
 -0.03369278 -0.07476041 -0.01600934  0.03106228 -0.03929523 -0.00965548
 -0.03117254 -0.02869355 -0.00639713  0.07005136  0.11992852 -0.07186633
 -0.05484002  0.09301733 -0.10434714  0.00577549  0.00986202  0.0843968
  0.03324864  0.07128087 -0.11527051  0.03340416  0.12104423  0.08858272
  0.03000315  0.00492779 -0.04545676  0.02414824 -0.0826384   0.05801052
  0.088882    0.00510528 -0.06728774 -0.01742942 -0.04770923  0.00521709
  0.11781982 -0.02409987 -0.00354232 -0.03823944 -0.01178038 -0.10880683
  0.01857707 -0.10638241 -0.10065196  0.1056949  -0.10972064  0.01115232
  0.04364643 -0.04822592  0.07247022  0.14709574  0.17352197 -0.05960387
  0.06066831  0.07324826  0.02732228 -0.1108685  -0.03131209 -0.04499397
 -0.03083643  0.07644227  0.10926916 -0.10941514  0.03769413 -0.10192162
  0.08157039  0.10359883 -0.0415872  -0.03513794  0.04804511 -0.07394598
  0.003403    0.04120627  0.04691189 -0.03527349  0.02601535  0.02506382
 -0.01905228 -0.05973076  0.00378947  0.01443153  0.00297571 -0.00657683
  0.08403873  0.05857912 -0.08672293  0.00990506 -0.06921919 -0.02851319
 -0.04588227 -0.06699555  0.0187632  -0.03700593  0.05530968 -0.04083645
  0.09544463 -0.03611298 -0.04136911  0.09089021 -0.03716478 -0.12875827
  0.01721622 -0.08194245  0.03708403  0.04734006 -0.02745273  0.1301027
  0.08772593  0.06858801 -0.08353757 -0.1083589  -0.03845153 -0.03337643
  0.07253522 -0.13127407 -0.11651333  0.02041268  0.02025139  0.01833059
 -0.15489452  0.00692403 -0.02396565 -0.10695435  0.02188756 -0.01458904
  0.01013779 -0.09879749 -0.01108354 -0.00535841 -0.03180149 -0.03848969
 -0.06829872 -0.11322614  0.13497414 -0.07431137 -0.06970305 -0.06039077
  0.01351372 -0.02057552  0.08803453 -0.00273833  0.08643718 -0.02149998
 -0.10168735  0.01917252  0.01934091 -0.07680167 -0.04372253  0.05902927
  0.0758426  -0.1379614   0.00978704  0.05735982  0.18015645 -0.05458089
 -0.01428355  0.11639019  0.15173467  0.067262   -0.09723032  0.0922464
  0.03147848 -0.07542663  0.07087953 -0.03645951 -0.00768409  0.11529247
  0.07308053  0.058521   -0.12904912 -0.04262946 -0.10368602  0.01382875
  0.06438235 -0.00424737 -0.07760412  0.02677475 -0.01109442 -0.02379926
  0.11002368  0.01828688 -0.0141602   0.00041908  0.08470961  0.0381649
  0.0619331  -0.02740148 -0.04377156  0.17106605 -0.02689633 -0.05457557
 -0.12677824 -0.0017025 ]
银行卡 [ 0.0020064  -0.12582362  0.05318305  0.0283359   0.01744255  0.07683774
 -0.05338099  0.00818257 -0.11905241  0.09063647 -0.01366772 -0.01847255
  0.05850454 -0.06208643  0.0307713  -0.06396349  0.03956702 -0.14173642
  0.01994346  0.00745677 -0.02944688  0.0437518   0.01580179  0.10437636
  0.0680668   0.08079242  0.01875649  0.00628908  0.14422947 -0.03093161
  0.02323569  0.06238109  0.00877618  0.05581926 -0.06325411  0.10076351
  0.03685934  0.04649306  0.02610702 -0.08644025 -0.03542202 -0.04404241
  0.10986771 -0.01169109  0.0201507  -0.07085665 -0.21713373 -0.0530113
 -0.05043821 -0.08462109 -0.07109319  0.02657342 -0.03226342 -0.05294865
  0.04772363 -0.06233726  0.08596623  0.16678461  0.05701409  0.02060115
  0.08606747  0.10063774 -0.02885185  0.02087508 -0.1313669  -0.11625469
 -0.03857704 -0.03816661  0.10073588 -0.08352916 -0.02168426  0.03696534
  0.08503008  0.08592335  0.04184807 -0.0035595  -0.01216846 -0.0741415
  0.02103992 -0.06390513 -0.02665631  0.01042432 -0.03313072  0.02231813
 -0.0034604  -0.08202203 -0.02120428  0.01524321 -0.0123321  -0.07683774
  0.06071484  0.05571516 -0.01901732 -0.01585849 -0.03093566  0.00175986
  0.06963967  0.02613965 -0.02027838 -0.03602182  0.0215654  -0.1327468
 -0.02682925 -0.04319679 -0.04858855  0.05294579 -0.04113655 -0.14582972
 -0.00343039  0.13475367  0.06273863  0.10220227 -0.03809872 -0.01009584
  0.05028957  0.09902795  0.04951636 -0.1509628   0.01154674 -0.12737814
  0.04874172 -0.18875733  0.01903876 -0.11057945  0.03252878  0.04331398
 -0.14611772  0.0029323  -0.00279414 -0.01302052  0.05997236 -0.07317081
 -0.06654229 -0.02533785 -0.01752687 -0.01622008  0.04656905 -0.0966278
 -0.0231659   0.05697217 -0.00970399 -0.03527814 -0.11501626 -0.07243834
  0.01447881 -0.11292244  0.07181066  0.11611748  0.07697328  0.0269786
 -0.04752902  0.13418843  0.13433063  0.06412594 -0.01221038  0.03821068
  0.15017886  0.00023273  0.15340893 -0.0379265   0.09783574 -0.01188785
 -0.10489922  0.04799685 -0.01728176 -0.00187991 -0.0500335   0.08492599
  0.04882556  0.0490166   0.00101737 -0.11152513 -0.08207658 -0.00050094
  0.09693913 -0.00232869 -0.03777596 -0.0345621   0.02627709  0.02142057
  0.06307712  0.07205983 -0.0689322   0.08850621  0.03687197 -0.00526052
 -0.02558987  0.0288709  -0.00789554 -0.1611513   0.0549803   0.03240443
 -0.1133293  -0.01580537  0.01606978  0.07134497  0.07844324  0.03663138
 -0.13035     0.09727262]
如何更换花呗绑定银行卡 [ 0.0412493  -0.12568748  0.01919322  0.05268444  0.0358183   0.0199526
 -0.05216572  0.03162935 -0.03498344  0.08230551 -0.00829105  0.08121108
  0.00221392 -0.00790647  0.00598419 -0.01487507  0.03209482 -0.12614128
  0.04561881  0.01181159  0.00836652  0.02594305  0.03038604  0.0664252
  0.04508034  0.06207125 -0.06020657  0.03175591  0.09905406 -0.00688738
  0.06645215  0.03975951  0.02941401  0.03271953 -0.04102795 -0.02124222
  0.05571816 -0.00524229 -0.03995117 -0.02624511  0.02869953  0.01845553
  0.12089871  0.03216907 -0.03624259 -0.05544149 -0.13717413 -0.10208185
  0.01515093 -0.05986634 -0.07403937 -0.01162395 -0.05105473 -0.0061044
 -0.00550084  0.03310549  0.03326062  0.09589361  0.06836328 -0.0232545
  0.05078406  0.15467706 -0.03573247  0.03850095 -0.12189175 -0.02785331
 -0.0493734  -0.02608894  0.0183759  -0.0705118  -0.0133743  -0.01127687
  0.09444313  0.10079495  0.02870584 -0.00436859  0.0310561   0.01119687
  0.04413298 -0.04008033 -0.01733718  0.04628557  0.02387342  0.07942477
 -0.02107191 -0.07042409 -0.07268834  0.01542195 -0.04603191 -0.05946932
  0.04655478  0.00670137 -0.003092   -0.06045286 -0.05705037  0.04378838
  0.07912513 -0.03156929  0.02904846 -0.03524711 -0.00807807 -0.02808475
 -0.02805975 -0.0021736  -0.06073626  0.04663873 -0.02418008 -0.08485784
  0.02031098 -0.00574332  0.0416776   0.01059347 -0.04028419 -0.03224884
  0.01817176  0.054317   -0.03081239 -0.06092433  0.00980488 -0.09460748
  0.07172652 -0.11109248 -0.00218574 -0.03745284  0.02943208 -0.00417768
 -0.10840914  0.01081005 -0.05826999  0.01585915  0.0171427  -0.03394227
 -0.02427577 -0.04739818  0.00153178  0.01586623 -0.0554506  -0.07791157
 -0.02628656  0.03936552 -0.00325188 -0.06084329 -0.1534984  -0.08339966
  0.00506257 -0.03322032  0.00966031  0.03537968  0.03382335  0.01260717
 -0.03350659  0.03046582  0.06236748  0.03318753 -0.04757497  0.02491214
  0.07317892  0.01342066  0.05721349 -0.01949456  0.11451782 -0.03474231
 -0.04525542  0.05784471  0.02967911 -0.00050992 -0.08027112  0.08595316
  0.0693429  -0.02649714  0.02773468 -0.02683689 -0.02491193  0.03494669
  0.0209149   0.01712708 -0.01435536  0.02850274 -0.01083589  0.03300544
  0.03262713  0.02435686 -0.04906328  0.03847725  0.02315824  0.02112937
 -0.05846664  0.01422625 -0.02060057 -0.11510853  0.05378071 -0.01535542
 -0.02704284 -0.01653615  0.03588494  0.07326718  0.06857118 -0.0049523
 -0.07754862  0.02760466]

  • get similarity score between text1 and text2
from text2vec import Similarity

a = '如何更换花呗绑定银行卡'
b = '花呗更改绑定银行卡'

sim = Similarity()
s = sim.get_score(a, b)
print(s)

output:

0.9519710685638405
  • get text similarity score between query and docs

from text2vec import SearchSimilarity

a = '如何更换花呗绑定银行卡'
b = '花呗更改绑定银行卡'
c = '我什么时候开通了花呗'

corpus = [a, b, c]
print(corpus)
search_sim = SearchSimilarity(corpus=corpus)

print(a, 'scores:', search_sim.get_scores(query=a))
print(a, 'rank similarities:', search_sim.get_similarities(query=a))

output:

['如何更换花呗绑定银行卡', '花呗更改绑定银行卡', '我什么时候开通了花呗']
如何更换花呗绑定银行卡 scores: [ 0.9527457  -0.07449248 -0.03204909]
如何更换花呗绑定银行卡 rank similarities: ['如何更换花呗绑定银行卡', '我什么时候开通了花呗', '花呗更改绑定银行卡']

Reference

  1. 将句子表示为向量(上):无监督句子表示学习(sentence embedding)
  2. 将句子表示为向量(下):无监督句子表示学习(sentence embedding)
  3. A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]
  4. 四种计算文本相似度的方法对比[Yves Peirsman]
  5. Improvements to BM25 and Language Models Examined
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].