All Projects → lizhichao → Vicword

lizhichao / Vicword

Licence: apache-2.0
一个纯php分词

Projects that are alternatives of or similar to Vicword

react-typewriter-js
Simple vanilla JS script to simulate text typewriting effect.
Stars: ✭ 18 (-96.51%)
Mutual labels:  word, split
Deta parser
快速中文分词分析word segmentation
Stars: ✭ 476 (-7.75%)
Mutual labels:  segmentation
Officer
👮 officer: office documents from R
Stars: ✭ 405 (-21.51%)
Mutual labels:  word
Caer
High-performance Vision library in Python. Scale your research, not boilerplate.
Stars: ✭ 452 (-12.4%)
Mutual labels:  segmentation
Cascadepsp
[CVPR2020] CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement
Stars: ✭ 407 (-21.12%)
Mutual labels:  segmentation
Jetson Inference
Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
Stars: ✭ 5,191 (+906.01%)
Mutual labels:  segmentation
Maskfusion
MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects
Stars: ✭ 404 (-21.71%)
Mutual labels:  segmentation
Face segmentation
Deep face segmentation in extremely hard conditions
Stars: ✭ 510 (-1.16%)
Mutual labels:  segmentation
Lidar Bonnetal
Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
Stars: ✭ 465 (-9.88%)
Mutual labels:  segmentation
Pytorch Goodies
PyTorch Boilerplate For Research
Stars: ✭ 427 (-17.25%)
Mutual labels:  segmentation
Trackit
[ECCV'20] Ocean: Object-aware Anchor-Free Tracking
Stars: ✭ 424 (-17.83%)
Mutual labels:  segmentation
Dipy
DIPY is the paragon 3D/4D+ imaging library in Python. Contains generic methods for spatial normalization, signal processing, machine learning, statistical analysis and visualization of medical images. Additionally, it contains specialized methods for computational anatomy including diffusion, perfusion and structural imaging.
Stars: ✭ 417 (-19.19%)
Mutual labels:  segmentation
Simpleitk
SimpleITK: a layer built on top of the Insight Toolkit (ITK), intended to simplify and facilitate ITK's use in rapid prototyping, education and interpreted languages.
Stars: ✭ 458 (-11.24%)
Mutual labels:  segmentation
Fiduswriter
Fidus Writer is an online collaborative editor for academics.
Stars: ✭ 405 (-21.51%)
Mutual labels:  word
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (-3.88%)
Mutual labels:  segmentation
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-21.71%)
Mutual labels:  word
Pose2seg
Code for the paper "Pose2Seg: Detection Free Human Instance Segmentation" @ CVPR2019.
Stars: ✭ 423 (-18.02%)
Mutual labels:  segmentation
Ttach
Image Test Time Augmentation with PyTorch!
Stars: ✭ 455 (-11.82%)
Mutual labels:  segmentation
Cpu Internals
Intel / AMD CPU Internals
Stars: ✭ 510 (-1.16%)
Mutual labels:  segmentation
Multi Human Parsing
🔥🔥Official Repository for Multi-Human-Parsing (MHP)🔥🔥
Stars: ✭ 507 (-1.74%)
Mutual labels:  segmentation

VicWord 一个纯php的分词

QQ交流群: 731475644

安装

composer require lizhichao/word

分词说明

  • 含有3种切分方法
    • getWord 长度优先切分 。最快
    • getShortWord 细粒度切分。比最快慢一点点
    • getAutoWord 自动切分 。效果最好
  • 可自定义词典,自己添加词语到词库,词库支持文本格式json和二级制格式igb 二进制格式词典小,加载快
  • dict.igb含有175662个词,欢迎大家补充词语到 dict.txt ,格式(词语 \t idf \t 词性)
    • idf 获取方法 百度搜索这个词语 Math.log(100000001/结果数量),如果你有更好的方法欢迎补充。
    • 词性 [标点符号,名词,动词,形容词,区别词,代词,数词,量词,副词,介词,连词,助词,语气词,拟声词,叹词] 取index ;标点符号取0
  • 三种分词结果对比
$fc = new VicWord();
$arr = $fc->getWord('北京大学生喝进口红酒,在北京大学生活区喝进口红酒');
//北京大学|生喝|进口|红酒|,|在|北京大学|生活区|喝|进口|红酒
//$arr 是一个数组 每个单元的结构[词语,词语位置,词性,这个词语是否包含在词典中] 这里只值列出了词语

$arr =  $fc->getShortWord('北京大学生喝进口红酒,在北京大学生活区喝进口红酒');
//北京|大学|生喝|进口|红酒|,|在|北京|大学|生活|区喝|进口|红酒

$arr = $fc->getAutoWord('北京大学生喝进口红酒,在北京大学生活区喝进口红酒');
//北京|大学生|喝|进口|红酒|,|在|北京大学|生活区|喝|进口|红酒

//对比
//qq的分词 http://nlp.qq.com/semantic.cgi#page2 
//百度的分词 http://ai.baidu.com/tech/nlp/lexical

分词速度

机器阿里云 Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
getWord 每秒140w字
getShortWord 每秒138w字
getAutoWord 每秒40w字
测试文本在百度百科拷贝的一段5000字的文本

制作词库

  • 词库支持utf-8的任意字符
  • 词典大小不影响 分词速度

只有一个方法 VicDict->add(词语,词性 = null)

require __DIR__.'/Lib/VicDict.php';

//目前可支持 igb 和 json 两种词典库格式;igb需要安装igbinary扩展,igb文件小,加载快
$path = ''; //词典地址
$dict = new VicDict($path);

//添加词语词库 add(词语,词性) 不分语言,可以是utf-8编码的任何字符
$dict->add('中国','n');

//保存词库
$dict->save();

demo

demo

该作者的其他软件

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].