Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

Droidtown / Articutapi

Licence: mit

API of Articut 中文斷詞 (兼具語意詞性標記)：「斷詞」又稱「分詞」，是中文資訊處理的基礎。Articut 不用機器學習，不需資料模型，只用現代白話中文語法規則，即能達到 SIGHAN 2005 F1-measure 94% 以上，Recall 96% 以上的成績。

Programming Languages

python

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing artificial-intelligence natural-language-understanding nlu pos-tagging part-of-speech-tagger

Projects that are alternatives of or similar to Articutapi

Botlibre

An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.

Stars: ✭ 412 (+63.49%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding, nlu

Gluon Nlp

NLP made easy

Stars: ✭ 2,344 (+830.16%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-51.98%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Graphbrain

Language, Knowledge, Cognition

Stars: ✭ 294 (+16.67%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Pytorch Pos Tagging

A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.

Stars: ✭ 96 (-61.9%)

Mutual labels: natural-language-processing, pos-tagging, part-of-speech-tagger

Chinese nlu by using rasa nlu

使用 RASA NLU 来构建中文自然语言理解系统（NLU）| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)

Stars: ✭ 99 (-60.71%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Dialogflow Ruby Client

Ruby SDK for Dialogflow

Stars: ✭ 148 (-41.27%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Stars: ✭ 224 (-11.11%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Coursera Natural Language Processing Specialization

Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.

Stars: ✭ 39 (-84.52%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Reading comprehension tf

Machine Reading Comprehension in Tensorflow

Stars: ✭ 37 (-85.32%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Spark Nlp Models

Models and Pipelines for the Spark NLP library

Stars: ✭ 88 (-65.08%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Xlnet extension tf

XLNet Extension in TensorFlow

Stars: ✭ 109 (-56.75%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Python Tutorial Notebooks

Python tutorials as Jupyter Notebooks for NLP, ML, AI

Stars: ✭ 52 (-79.37%)

Mutual labels: natural-language-processing, natural-language-understanding, part-of-speech-tagger

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-59.13%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Nlp Recipes

Natural Language Processing Best Practices & Examples

Stars: ✭ 5,783 (+2194.84%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Chat

基于自然语言理解与机器学习的聊天机器人，支持多用户并发及自定义多轮对话

Stars: ✭ 516 (+104.76%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Oie Resources

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Stars: ✭ 283 (+12.3%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Clause

🏇 聊天机器人，自然语言理解，语义理解

Stars: ✭ 323 (+28.17%)

Mutual labels: natural-language-processing, natural-language-understanding, nlu

Ciff

Cornell Instruction Following Framework

Stars: ✭ 23 (-90.87%)

Mutual labels: artificial-intelligence, natural-language-processing, natural-language-understanding

Botsharp

The Open Source AI Chatbot Platform Builder in 100% C# Running in .NET Core with Machine Learning algorithm.

Stars: ✭ 1,103 (+337.7%)

Mutual labels: artificial-intelligence, natural-language-processing, nlu

View All Similar Projects ➔

Articut 中文斷詞暨詞性標記服務

[依語法結構計算，而非統計方法的中文斷詞。]

Articut API Website

Document

Benchmark

設計目標

名稱	ArticutAPI	MP_ArticutAPI	WS_ArticutAPI
產品	Online / Docker	Docker	Docker
技術	HTTP Request	MultiProcessing	WebSocket
特色	簡單易用	批次處理	即時處理
適用情景	任何	文本分析	聊天機器人

處理速度

名稱	ArticutAPI	MP_ArticutAPI	WS_ArticutAPI
時間	0.1252 秒	0.1206 秒	0.0677 秒

大量文本

句數	ArticutAPI	MP_ArticutAPI	WS_ArticutAPI
方法	parse()	bulk_parse(20)	parse()
1K	155 秒	8 秒	18 秒
2K	306 秒	14 秒	35 秒
3K	455 秒	17 秒	43 秒

測試平台爲 4 核心 CPU 並使用 4 個 Process。
MP_ArticutAPI使用 bulk_parse(bulkSize=20) 方法。
WS_ArticutAPI使用 parse() 方法。

ArticutAPI

安裝方法

pip3 install ArticutAPI

說明文件

函數說明請參閱 Docs/index.html

使用方法

Articut CWS (Chinese word segmentation)

from ArticutAPI import Articut
from pprint import pprint

articut = Articut()
inputSTR = "會被大家盯上，才證明你有實力。"
result = articut.parse(inputSTR)
pprint(result)

回傳結果

{"exec_time": 0.06723856925964355,
 "level": "lv2",
 "msg": "Success!",
 
 "result_pos": ["<MODAL>會</MODAL><ACTION_lightVerb>被</ACTION_lightVerb><ENTITY_nouny>大家</ENTITY_nouny><ACTION_verb>盯上</ACTION_verb>",
                "，",
                "<MODAL>才</MODAL><ACTION_verb>證明</ACTION_verb><ENTITY_pronoun>你</ENTITY_pronoun><ACTION_verb>有</ACTION_verb><ENTITY_noun>實力</ENTITY_noun>",
                "。"],
 "result_segmentation": "會/被/大家/盯上/，/才/證明/你/有/實力/。/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9985,
 "product": "https://api.droidtown.co/product/",
 "document": "https://api.droidtown.co/document/"
}

列出斷詞結果所有詞性標記的內容詞

可以依需求找出「名詞」、「動詞」或是「形容詞」…等詞彙語意本身已經完整的詞彙。

inputSTR = "你計劃過地球人類補完計劃"
result = articut.parse(inputSTR, level="lv1")
pprint(result["result_pos"])

#列出所有的 content word.
contentWordLIST = articut.getContentWordLIST(result)
pprint(contentWordLIST)

#列出所有的 verb word. (動詞)
verbStemLIST = articut.getVerbStemLIST(result)
pprint(verbStemLIST)

#列出所有的 noun word. (名詞)
nounStemLIST = articut.getNounStemLIST(result)
pprint(nounStemLIST)

#列出所有的 location word. (地方名稱)
locationStemLIST = articut.getLocationStemLIST(result)
pprint(locationStemLIST)

回傳結果

#result["result_pos"]
["<ENTITY_pronoun>你</ENTITY_pronoun><ACTION_verb>計劃</ACTION_verb><ASPECT>過</ASPECT><LOCATION>地球</LOCATION><ENTITY_oov>人類</ENTITY_oov><ACTION_verb>補完</ACTION_verb><ENTITY_nounHead>計劃</ENTITY_nounHead>"]

#列出所有的 content word.
[[(177, 179, "計劃"), (144, 146, "補完"), (116, 118, "人類"), (47, 49, "計劃")]]

#列出所有的 verb word. (動詞)
[[(144, 146, "補完"), (47, 49, "計劃")]]

#列出所有的 noun word. (名詞)
[[(177, 179, "計劃"), (116, 118, "人類")]]

#列出所有的 location word. (地方名稱)
[[(91, 93, "地球")]]

取得 Articut 版本列表

result = articut.versions()
pprint(result)

回傳結果

{"msg": "Success!",
 "status": True,
 "versions": [{"level": ["lv1", "lv2"],
               "release_date": "2019-04-25",
               "version": "latest"},
              {"level": ["lv1", "lv2"],
               "release_date": "2019-04-25",
               "version": "v118"},
              {"level": ["lv1", "lv2"],
               "release_date": "2019-04-24",
               "version": "v117"},...
}

進階用法

進階用法01 >> Articut Level :斷詞的深度。數字愈小，切得愈細 (預設: lv2)。

inputSTR = "小紅帽"
result = articut.parse(inputSTR, level="lv1")
pprint(result)

回傳結果 lv1

極致斷詞，適合 NLU 或機器自動翻譯使用。呈現結果將句子中的每個元素都儘量細分出來。

{"exec_time": 0.04814624786376953,
 "level": "lv1",
 "msg": "Success!",
 "result_pos": ["<MODIFIER>小</MODIFIER><MODIFIER_color>紅</MODIFIER_color><ENTITY_nounHead>帽</ENTITY_nounHead>"],
 "result_segmentation": "小/紅/帽/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9997,...}

回傳結果 lv2

詞組斷詞，適合文本分析、特徵值計算、關鍵字擷取…等應用。呈現結果將以具意義的最小單位呈現。

{"exec_time": 0.04195523262023926,
 "level": "lv2",
 "msg": "Success!",
 "result_pos": ["<ENTITY_nouny>小紅帽</ENTITY_nouny>"],
 "result_segmentation": "小紅帽/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9997,...}

進階用法 02 >> UserDefinedDictFile :使用者自定詞典。

因為 Articut 只處理「語言知識」而不處理「百科知識」。我們提供「使用者自定義」詞彙表的功能，使用 Dictionary 格式，請自行編寫。

UserDefinedFile.json

{"雷姆":["小老婆"],
 "艾蜜莉亞":["大老婆"],
 "初音未來": ["初音", "只是個軟體"],
 "李敏鎬": ["全民歐巴", "歐巴"]}

runArticut.py

from ArticutAPI import Articut
from pprint import pprint

articut = Articut()
userDefined = "./UserDefinedFile.json"
inputSTR = "我的最愛是小老婆，不是初音未來。"

# 使用自定義詞典
result = articut.parse(inputSTR, userDefinedDictFILE=userDefined)
pprint(result)

# 未使用自定義詞典
result = articut.parse(inputSTR)
pprint(result)

回傳結果

# 使用自定義詞典
{"result_pos": ["<ENTITY_pronoun>我</ENTITY_pronoun><FUNC_inner>的</FUNC_inner><ACTION_verb>最愛</ACTION_verb><AUX>是</AUX><UserDefined>小老婆</UserDefined>",
                "，",
                "<FUNC_negation>不</FUNC_negation><AUX>是</AUX><UserDefined>初音未來</UserDefined>",
                "。"],
 "result_segmentation": "我/的/最愛/是/小老婆/，/不/是/初音未來/。/",...}

# 未使用自定義詞典
{"result_pos": ["<ENTITY_pronoun>我</ENTITY_pronoun><FUNC_inner>的</FUNC_inner><ACTION_verb>最愛</ACTION_verb><AUX>是</AUX><ENTITY_nouny>小老婆</ENTITY_nouny>",
                "，",
                "<FUNC_negation>不</FUNC_negation><AUX>是</AUX><ENTITY_nouny>初音</ENTITY_nouny><TIME_justtime>未來</TIME_justtime>",
                "。"],
 "result_segmentation": "我/的/最愛/是/小老婆/，/不/是/初音/未來/。/",...}

進階用法 03 - 調用資料觀光資訊資料庫

政府開放平台中存有「交通部觀光局蒐集各政府機關所發佈空間化觀光資訊」。Articut 可取用其中的資訊，並標記為 <KNOWLEDGE_place>

上傳內容 (JSON 格式)

{
	"username": "[email protected]",
	"api_key": "[email protected]",
	"input_str": "花蓮的原野牧場有一間餐廳",
	"version": "v137",
	"level": "lv1",
	"opendata_place": true
}

回傳內容 (JSON 格式)

{
	"exec_time": 0.013453006744384766,
	"level": "lv1",
	"msg": "Success!",
	"result_pos": ["<LOCATION>花蓮</LOCATION><FUNC_inner>的</FUNC_inner><KNOWLEDGE_place>原野牧場</KNOWLEDGE_place><ACTION_verb>有</ACTION_verb><ENTITY_classifier>一間</ENTITY_classifier><ENTITY_noun>餐廳</ENTITY_noun>"],
	"result_segmentation": "花蓮/的/原野牧場/有/一間/餐廳/",
	"status": True,
	"version": "v137",
	"word_count_balance": 99987
}

進階用法 04 - 基於 TF-IDF 算法的關鍵詞抽取

articut.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence 為要提取關鍵詞的文本
- topK 為提取幾個 TF-IDF 的關鍵詞，預設值為 20
- withWeight 為是否返回關鍵詞權重值，預設值為 False
- allowPOS 僅抽取指定詞性的詞，預設值為空，亦即全部抽取
articut.analyse.TFIDF(idf_path=None) 新建 TFIDF 物件，idf_path 為 IDF 語料庫路徑

使用範例： https://github.com/Droidtown/ArticutAPI/blob/master/ArticutAPI.py#L624

進階用法 05 - 基於 TextRank 算法的關鍵詞抽取

articut.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=())
- sentence 為要提取關鍵詞的文本
- topK 為提取幾個 TF-IDF 的關鍵詞，預設值為 20
- withWeight 為是否返回關鍵詞權重值，預設值為 False
- allowPOS 僅抽取指定詞性的詞，預設值為空，亦即全部抽取
articut.analyse.TextRank() 新建 TextRank 物件

算法論文：TextRank: Bringing Order into Texts

基本思想：

將待抽取關鍵詞的文本斷詞
以固定的窗格大小 (預設值為 5，通過 span 屬性調整)，詞之間的共現關係，建構出不帶權圖
計算途中節點的 PageRank

使用範例： https://github.com/Droidtown/ArticutAPI/blob/master/ArticutAPI.py#L629

進階用法 06 - 使用 GraphQL 查詢斷詞結果

使用 GraphiQL 工具

環境需求

Python 3.6.1
$ pip install graphene
$ pip install starlette
$ pip install jinja2
$ pip install uvicorn

執行 ArticutGraphQL.py 帶入 Articut 斷詞結果檔案路徑，並開啟瀏覽器輸入網址 http://0.0.0.0:8000/

$ python ArticutGraphQL.py articutResult.json

使用範例 01

使用範例 02

使用 Articut-GraphQL

安裝 graphene 模組

$ pip install graphene

使用範例 01

inputSTR = "地址：宜蘭縣宜蘭市縣政北七路六段55巷1號2樓"
result = articut.parse(inputSTR)
with open("articutResult.json", "w", encoding="utf-8") as resultFile:
    json.dump(result, resultFile, ensure_ascii=False)
	
graphQLResult = articut.graphQL.query(
    filePath="articutResult.json",
    query="""
	{
	  meta {
	    lang
	    description
	  }
	  doc {
	    text
	    tokens {
	      text
	      pos_
	      tag_
	      isStop
	      isEntity
	      isVerb
	      isTime
	      isClause
	      isKnowledge
	    }
	  }
	}""")
pprint(graphQLResult)

回傳結果

使用範例 02

inputSTR = "劉克襄在本次活動當中，分享了台北中山北路一日遊路線。他表示當初自己領著柯文哲一同探索了雙連市場與中山捷運站的小吃與商圈，還有商圈內的文創商店與日系雜物店鋪，都令柯文哲留下深刻的印象。劉克襄也認為，雙連市場內的魯肉飯、圓仔湯與切仔麵，還有九條通的日式店家、居酒屋等特色，也能讓人感受到台北舊城區不一樣的魅力。"
result = articut.parse(inputSTR)
with open("articutResult.json", "w", encoding="utf-8") as resultFile:
    json.dump(result, resultFile, ensure_ascii=False)
	
graphQLResult = articut.graphQL.query(
    filePath="articutResult.json",
    query="""
	{
	  meta {
	    lang
	    description
	  }
	  doc {
	    text
	    ents {
	      persons {
	        text
	        pos_
	        tag_
	      }
	    }
	  }
	}""")
pprint(graphQLResult)

回傳結果

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 252

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗