All Projects → cjymz886 → find-Chinese-medical-words

cjymz886 / find-Chinese-medical-words

Licence: MIT license
发现新词 无监督词库生成 医学词库生成 发现未登录词

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to find-Chinese-medical-words

awesome-react-cn
收集react库,项目,文章,vscode插件的中文仓库,更新中
Stars: ✭ 20 (-72.6%)
Mutual labels:  chinese
ansj seg
ansj分词.ict的真正java实现.分词效果速度都超过开源版的ict. 中文分词,人名识别,词性标注,用户自定义词典
Stars: ✭ 6,213 (+8410.96%)
Mutual labels:  chinese
myanbin.github.io
饮冰先生的博客
Stars: ✭ 32 (-56.16%)
Mutual labels:  chinese
Margoulineur2000
NFC
Stars: ✭ 24 (-67.12%)
Mutual labels:  chinese
resume
My Chinese and English Resumes in LaTeX with Font Awesome 5
Stars: ✭ 296 (+305.48%)
Mutual labels:  chinese
fandom-publics
The Chinese edition of The Internet and New Social Formation in China (粉丝公众), authored by Weiyu Zhang, translated by the CNPolitics translation team.
Stars: ✭ 62 (-15.07%)
Mutual labels:  chinese
han
Using Tensorflow to train a model to detect miswritten Chinese characters.
Stars: ✭ 12 (-83.56%)
Mutual labels:  chinese
wechit
WeChat in Terminal (微信终端版)
Stars: ✭ 74 (+1.37%)
Mutual labels:  chinese
chinese-rhymer
轻量中文押韵神器,100%绝对可用,傻瓜式命令行操作,秒速实现烈焰单押,闪电双押,龙卷三押以及海啸式四押,目前版本 v0.2.6。Search for rhymes for Chinese words, with 1, 2, 3 and 4 characters, released on PyPI with current version of 0.2.6.
Stars: ✭ 72 (-1.37%)
Mutual labels:  chinese
pinyin4js
A opensource javascript library for converting chinese to pinyin。welcome Star : P
Stars: ✭ 153 (+109.59%)
Mutual labels:  chinese
awesome-malware-analysis
Defund the Police.
Stars: ✭ 9,181 (+12476.71%)
Mutual labels:  chinese
chinese-calendar-golang
📅 公历, 农历, 干支历转换包, 提供精确的日历转换.
Stars: ✭ 104 (+42.47%)
Mutual labels:  chinese
chinese-diceware
Diceware word lists in Chinese
Stars: ✭ 27 (-63.01%)
Mutual labels:  chinese
goSpider
some small project and some articles
Stars: ✭ 56 (-23.29%)
Mutual labels:  chinese
Chi-Wiki
A programmer who is not good at Chinese is not a advanced middle school student.
Stars: ✭ 18 (-75.34%)
Mutual labels:  chinese
designing-with-libreoffice
The work to translate Designing with LibreOffice book into traditional Chinese.
Stars: ✭ 17 (-76.71%)
Mutual labels:  chinese
djinni
djinni中文文档,一个根据djinni写成的demo(ios),解决了macOS Sierra 10.12环境下无法build的问题
Stars: ✭ 52 (-28.77%)
Mutual labels:  chinese
chinese-novel
📙 Chinese novel database 最全的中国古典小说数据库。
Stars: ✭ 131 (+79.45%)
Mutual labels:  chinese
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (-2.74%)
Mutual labels:  chinese
FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (+110.96%)
Mutual labels:  chinese

find-Chinese-medical-words

从网上抓取的医疗语料中,以一种改进的无监督方法寻找语料库存在的词;
主要方法利用互信息熵,正向最大匹配,搜索引擎进行迭代来找词;
语料库不限领域,本实验是以医疗领域的文本;

环境

python2/3
requests
lxml

方法

step1:统计语料库中出现单字,双字的频率,前后链接的字相关信息;

step2:对统计出的单字和双字的结果,使用互信熵,选择大于阈值K=10.8的词加入词库,作为初始词库;

step3:有了初始词库,使用正向最大匹配,对语料库进行切分,对切分出来的字串按频率排序输出并记下数量seg_num;

step4:对切分产生的字串按频率排序,前H=2000的字串进行搜索引擎(百度),若字串是“百度百科”收录词条,将该字串作为词加入词库,或者在搜索页面的文本中出现的次数超过阈值R=60,也将该字串作为词加入词库;

step5:更新词库后,重复step3,step4进行迭代,,当searh_num=0时,结束迭代;当seg_num小于设定的Y=5000,进行最后一次step4,并H设定为H=seg_num,执行完后结束迭代,最后词库就是本程序所找的词;

流程图

image

算法

image

image

image

运行

python medfw.py
其中涉及的参数可根据实际环境进行调整

结果

最终输出的词库在./data/dict.txt文件中;./data目录中是语料库和程序产生的中间数据。
在本次实验中,用了约50M的医学领域的语料,迭代了9次,找到有4967个词。

结果样例

惶惶 org
爷爷 org
曼佗 org
垮垮 org
萧轼 org
艇舰 org
蝰蛇 org
攸琐 org
咔嚓 org
喀嚓 org
铒翠 org
诚挚 org
迪厅 org
不足 iter_0
知情同意书 iter_0
运动 iter_0
状态 iter_0
瘢痕 iter_0
心悸 iter_0
步态 iter_0
祸首 iter_0
照相 iter_0
形成 iter_0
面容 iter_0
先天 iter_0
动作 iter_0
由于 iter_0
价格 iter_0
行为 iter_0
淋病 iter_0
包括 iter_0
栓塞 iter_0
球感 iter_0

image

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].