All Projects → znwang25 → fuzzychinese

znwang25 / fuzzychinese

Licence: BSD-3-Clause license
A small package to fuzzy match chinese words

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to fuzzychinese

Rasa nlu chi
Turn Chinese natural language into structured data 中文自然语言理解
Stars: ✭ 1,166 (+2232%)
Mutual labels:  natural-language, chinese
s3-concat
Concatenate Amazon S3 files remotely using flexible patterns
Stars: ✭ 32 (-36%)
Mutual labels:  text-processing
fuzzywuzzy
Fuzzy string matching for PHP
Stars: ✭ 60 (+20%)
Mutual labels:  fuzzy-matching
linguistic-datasets-portuguese
Linguistic Datasets for Portuguese: Lista de conjuntos de dados linguísticos para língua portuguesa com licença flexíveis: banco de dados, lista de palavras, sinônimos, antônimos, dicionário temático, tesauro, linked data, semântica, ontologia e representação de conhecimento
Stars: ✭ 46 (-8%)
Mutual labels:  natural-language
dif
'dif' is a Linux preprocessing front end to gvimdiff/meld/kompare
Stars: ✭ 18 (-64%)
Mutual labels:  text-processing
fountain
Natural Language Data Augmentation Tool for Conversational Systems
Stars: ✭ 113 (+126%)
Mutual labels:  natural-language
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-68%)
Mutual labels:  text-processing
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+82%)
Mutual labels:  text-processing
PHP-Chinese
PHP Chinese Conversion (中文繁簡轉換)
Stars: ✭ 37 (-26%)
Mutual labels:  chinese
deep ethereum
电子书:以太坊技术与实现
Stars: ✭ 304 (+508%)
Mutual labels:  chinese
stable-baselines-zh
Stable Baselines官方文档中文版
Stars: ✭ 75 (+50%)
Mutual labels:  chinese
predict Lottery ticket
双色球+大乐透彩票AI预测
Stars: ✭ 341 (+582%)
Mutual labels:  chinese
Yoyo-leaf
Yoyo-leaf is an awesome command-line fuzzy finder.
Stars: ✭ 49 (-2%)
Mutual labels:  fuzzy-matching
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (+84%)
Mutual labels:  chinese
flask-docs-zh
Flask 文档简体中文翻译
Stars: ✭ 93 (+86%)
Mutual labels:  chinese
NLP Quickbook
NLP in Python with Deep Learning
Stars: ✭ 516 (+932%)
Mutual labels:  natural-language
estratto
parsing fixed width files content made easy
Stars: ✭ 12 (-76%)
Mutual labels:  text-processing
ChineseNames
🀄 Chinese Name Database (1930-2008)
Stars: ✭ 99 (+98%)
Mutual labels:  chinese
Emotion-recognition-from-tweets
A comprehensive approach on recognizing emotion (sentiment) from a certain tweet. Supervised machine learning.
Stars: ✭ 17 (-66%)
Mutual labels:  text-processing
text2video
Text to Video Generation Problem
Stars: ✭ 28 (-44%)
Mutual labels:  text-processing

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词,地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有三种分析方式可以选择,笔划分析(stroke),部首分析(radical),和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。

匹配完成后返回相似度分数,匹配的相近词语及其原有索引号。

    import pandas as pd
    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)
top1 top2 top1_score top2_score top1_index top2_index
达茂联合旗 达尔罕茂明安联合旗 长白朝鲜族自治县 0.824751 0.287237 3 0
长阳县 长阳土家族自治县 长白朝鲜族自治县 0.610285 0.475000 1 0
汩罗市 汨罗市 长白朝鲜族自治县 1.000000 0.152093 4 0

其他功能

  • 直接使用Stroke, Radical进行汉字分解。

    stroke = Stroke()
    radical = Radical()
    print("像", stroke.get_stroke("像"))
    print("像", radical.get_radical("像"))
    像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
    像 人象
    
  • 使用FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较,获得相似度分数。

  • 详情请参见说明文档.

致谢

拆字数据来自于 漢語拆字字典 by 開放詞典網

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.

After the matching, similarity score, matched words and its corresponding index are returned.

    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)
top1 top2 top1_score top2_score top1_index top2_index
达茂联合旗 达尔罕茂明安联合旗 长白朝鲜族自治县 0.824751 0.287237 3 0
长阳县 长阳土家族自治县 长白朝鲜族自治县 0.610285 0.475000 1 0
汩罗市 汨罗市 长白朝鲜族自治县 1.000000 0.152093 4 0

Other use

  • Directly use Stroke, Radical to decompose Chinese character into strokes or radicals.

    stroke = Stroke()
    radical = Radical()
    print("像", stroke.get_stroke("像"))
    print("像", radical.get_radical("像"))
    像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
    像 人象
    
  • Use FuzzyChineseMatch.compare_two_columns(X, Y) to compare the pair of words in each row to get similarity score.

  • See documentation for details.

Credits

Data for Chinese radicals are from 漢語拆字字典 by 開放詞典網.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].