All Projects → 1000-7 → xinlp

1000-7 / xinlp

Licence: other
把李航老师《统计学习方法》的后几章的算法都用java实现了一遍,实现盒子与球的EM算法,扩展到去GMM训练,后来实现了HMM分词(实现了HMM分词的参数训练)和CRF分词(借用CRF++训练的参数模型),最后利用tensorFlow把BiLSTM+CRF实现了,然后为lucene包装了一个XinAnalyzer

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to xinlp

Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+6042.86%)
Mutual labels:  crf, lda
LinLP
使用Python进行自然语言处理相关实践,如新词发现,主题模型,隐马尔模型词性标注,Word2Vec,情感分析
Stars: ✭ 43 (+104.76%)
Mutual labels:  hmm, lda
ChineseNER
中文NER的那些事儿
Stars: ✭ 241 (+1047.62%)
Mutual labels:  crf, bilstm-crf
Machine Learning Code
《统计学习方法》与常见机器学习模型(GBDT/XGBoost/lightGBM/FM/FFM)的原理讲解与python和类库实现
Stars: ✭ 169 (+704.76%)
Mutual labels:  hmm, crf
deepseg
Chinese word segmentation in tensorflow 2.x
Stars: ✭ 23 (+9.52%)
Mutual labels:  crf, bilstm-crf
BiLSTM-CRF-NER-PyTorch
This repo contains a PyTorch implementation of a BiLSTM-CRF model for named entity recognition task.
Stars: ✭ 109 (+419.05%)
Mutual labels:  crf, bilstm-crf
CIP
Basic exercises of chinese information processing
Stars: ✭ 32 (+52.38%)
Mutual labels:  hmm, crf
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (+9.52%)
Mutual labels:  crf, lda
mahjong
开源中文分词工具包,中文分词Web API,Lucene中文分词,中英文混合分词
Stars: ✭ 40 (+90.48%)
Mutual labels:  hmm, crf
reacnetgenerator
an automatic reaction network generator for reactive molecular dynamics simulation
Stars: ✭ 25 (+19.05%)
Mutual labels:  hmm
Topic-Modeling-Workshop-with-R
A workshop on analyzing topic modeling (LDA, CTM, STM) using R
Stars: ✭ 51 (+142.86%)
Mutual labels:  lda
bioinf-commons
Bioinformatics library in Kotlin
Stars: ✭ 21 (+0%)
Mutual labels:  hmm
pymc3-hmm
Hidden Markov models in PyMC3
Stars: ✭ 81 (+285.71%)
Mutual labels:  hmm
Machine-Learning-Models
In This repository I made some simple to complex methods in machine learning. Here I try to build template style code.
Stars: ✭ 30 (+42.86%)
Mutual labels:  lda
sequence tagging
Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German
Stars: ✭ 25 (+19.05%)
Mutual labels:  bilstm-crf
interspeech2018 submission01
Supplementary information and code for INTERSPEECH 2018 paper: Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions
Stars: ✭ 43 (+104.76%)
Mutual labels:  hmm
libfmp
libfmp - Python package for teaching and learning Fundamentals of Music Processing (FMP)
Stars: ✭ 71 (+238.1%)
Mutual labels:  hmm
Gse
Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词
Stars: ✭ 1,695 (+7971.43%)
Mutual labels:  hmm
deepvis
machine learning algorithms in Swift
Stars: ✭ 54 (+157.14%)
Mutual labels:  lda
BayesHMM
Full Bayesian Inference for Hidden Markov Models
Stars: ✭ 35 (+66.67%)
Mutual labels:  hmm

xinlp

学习《统计学习方法》,从第八章的EM算法到第十一章的CRF都基本实现了一遍,还结合现在深度学习热潮,实现了Bi-LSTM+CRF分词

2019.03.21

实现了一个简单的LDA模型,Gibbs采样迭代更新

EM和GMM

先是学习了EM算法,实现了GMM高斯混合模型
高斯混合模型和kmeans很像,亲身测试男女身高这种事情GMM很难训练出来的

相关博客

https://www.unclewang.info/learn/machine-learning/730/
https://www.unclewang.info/learn/machine-learning/735/

自己实现HMM分词

HMM 盒子与球问题 三种问题(概率,学习,预测)都实现了
主要思想就是参数训练好的情况下(jieba分词的参数),viterbi算法实现就好。
HMM参数使用的python jieba分词的参数
也尝试用Baum-Welch算法进行参数训练学习,发现效果贼差。。。。

相关博客

https://www.unclewang.info/learn/machine-learning/745/
https://www.unclewang.info/learn/machine-learning/749/

自己实现CRF分词

CRF参照了Ansj和Hanlp两个的写法。
CRF参数来自于CRF++训练得到,利用训练的参数进行分词
CRF 人工定义特征函数太费劲了,其实就是特征工程,参数学习要用的方法也没实现。其实就是特征函数难定义。使用viterbi算法进行分词,学习借助 CRF,概率和hmm类似没有实现。

相关博客

https://www.unclewang.info/learn/machine-learning/753/

自己实现Bi-LSTM+CRF分词

实现的有两个版本:
ugly版本是第一遍直接实现的,因为以前也没怎么好好写过python,所以就随便命名、结构也很乱,做的时候不知道的东西就百度+bing去搜,反正遇山修路,过河修桥那样的实现的....,不过代码很精简,没有任何封装,看起来其实很流畅
非ugly版本是从github上找了一个很厉害的项目guillaumegenthial/sequence_tagging,仿照这种python代码完整度非常高的项目去重新写了一边代码(有很多地方直接抄的😊),代码很清晰,几个文件各司其职,也算没有辜负python(一个面向对象的动态解释型强语言)

相关博客

https://www.unclewang.info/learn/machine-learning/756/

自己实现一个支持lucene的分词器——XinAnalyzer

用lucene的时候,看见了一个叫SmartChineseAnalyzer的支持中文分词,效果不咋的,发现竟然用的HMM分词,当时一句"我的天",于是就想自己也写一个。。。
2018.12.11 自己的HMM分词器已经支持了
2018.12.13 支持CRF分词(tcp通信),支持BiLSTM+CRF分词(http通信)

相关博客

https://www.unclewang.info/learn/java/760/

使用到的各种数据

链接:https://pan.baidu.com/s/1toe-0h4k9Ck_yGs-RwMqAA 密码:sn7o

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].