All Projects → Droidtown → Articutapi

Droidtown / Articutapi

Licence: mit
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Articutapi

Botlibre
An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.
Stars: ✭ 412 (+63.49%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding, nlu
Gluon Nlp
NLP made easy
Stars: ✭ 2,344 (+830.16%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-51.98%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Graphbrain
Language, Knowledge, Cognition
Stars: ✭ 294 (+16.67%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (-61.9%)
Mutual labels:  natural-language-processing, pos-tagging, part-of-speech-tagger
Chinese nlu by using rasa nlu
使用 RASA NLU 来构建中文自然语言理解系统(NLU)| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)
Stars: ✭ 99 (-60.71%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Dialogflow Ruby Client
Ruby SDK for Dialogflow
Stars: ✭ 148 (-41.27%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (-11.11%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-84.52%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Reading comprehension tf
Machine Reading Comprehension in Tensorflow
Stars: ✭ 37 (-85.32%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Spark Nlp Models
Models and Pipelines for the Spark NLP library
Stars: ✭ 88 (-65.08%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Xlnet extension tf
XLNet Extension in TensorFlow
Stars: ✭ 109 (-56.75%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Python Tutorial Notebooks
Python tutorials as Jupyter Notebooks for NLP, ML, AI
Stars: ✭ 52 (-79.37%)
Mutual labels:  natural-language-processing, natural-language-understanding, part-of-speech-tagger
Spokestack Python
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.
Stars: ✭ 103 (-59.13%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+2194.84%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Chat
基于自然语言理解与机器学习的聊天机器人,支持多用户并发及自定义多轮对话
Stars: ✭ 516 (+104.76%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+12.3%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Clause
🏇 聊天机器人,自然语言理解,语义理解
Stars: ✭ 323 (+28.17%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Ciff
Cornell Instruction Following Framework
Stars: ✭ 23 (-90.87%)
Mutual labels:  artificial-intelligence, natural-language-processing, natural-language-understanding
Botsharp
The Open Source AI Chatbot Platform Builder in 100% C# Running in .NET Core with Machine Learning algorithm.
Stars: ✭ 1,103 (+337.7%)
Mutual labels:  artificial-intelligence, natural-language-processing, nlu

Articut 中文斷詞暨詞性標記服務

[依語法結構計算,而非統計方法的中文斷詞。]

Articut API Website

Document

Articut Demo

Benchmark

設計目標

名稱 ArticutAPI MP_ArticutAPI WS_ArticutAPI
產品 Online / Docker Docker Docker
技術 HTTP Request MultiProcessing WebSocket
特色 簡單易用 批次處理 即時處理
適用情景 任何 文本分析 聊天機器人

處理速度

名稱 ArticutAPI MP_ArticutAPI WS_ArticutAPI
時間 0.1252 秒 0.1206 秒 0.0677 秒

大量文本

句數 ArticutAPI MP_ArticutAPI WS_ArticutAPI
方法 parse() bulk_parse(20) parse()
1K 155 秒 8 秒 18 秒
2K 306 秒 14 秒 35 秒
3K 455 秒 17 秒 43 秒
  • 測試平台爲 4 核心 CPU 並使用 4 個 Process。
  • MP_ArticutAPI使用 bulk_parse(bulkSize=20) 方法。
  • WS_ArticutAPI使用 parse() 方法。

ArticutAPI

安裝方法

pip3 install ArticutAPI

說明文件

函數說明請參閱 Docs/index.html

使用方法

Articut CWS (Chinese word segmentation)

from ArticutAPI import Articut
from pprint import pprint

articut = Articut()
inputSTR = "會被大家盯上,才證明你有實力。"
result = articut.parse(inputSTR)
pprint(result)

回傳結果

{"exec_time": 0.06723856925964355,
 "level": "lv2",
 "msg": "Success!",
 
 "result_pos": ["<MODAL>會</MODAL><ACTION_lightVerb>被</ACTION_lightVerb><ENTITY_nouny>大家</ENTITY_nouny><ACTION_verb>盯上</ACTION_verb>",
                ",",
                "<MODAL>才</MODAL><ACTION_verb>證明</ACTION_verb><ENTITY_pronoun>你</ENTITY_pronoun><ACTION_verb>有</ACTION_verb><ENTITY_noun>實力</ENTITY_noun>",
                "。"],
 "result_segmentation": "會/被/大家/盯上/,/才/證明/你/有/實力/。/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9985,
 "product": "https://api.droidtown.co/product/",
 "document": "https://api.droidtown.co/document/"
}

列出斷詞結果所有詞性標記的內容詞

可以依需求找出「名詞」、「動詞」或是「形容詞」…等詞彙語意本身已經完整的詞彙。

inputSTR = "你計劃過地球人類補完計劃"
result = articut.parse(inputSTR, level="lv1")
pprint(result["result_pos"])

#列出所有的 content word.
contentWordLIST = articut.getContentWordLIST(result)
pprint(contentWordLIST)

#列出所有的 verb word. (動詞)
verbStemLIST = articut.getVerbStemLIST(result)
pprint(verbStemLIST)

#列出所有的 noun word. (名詞)
nounStemLIST = articut.getNounStemLIST(result)
pprint(nounStemLIST)

#列出所有的 location word. (地方名稱)
locationStemLIST = articut.getLocationStemLIST(result)
pprint(locationStemLIST)

回傳結果

#result["result_pos"]
["<ENTITY_pronoun>你</ENTITY_pronoun><ACTION_verb>計劃</ACTION_verb><ASPECT>過</ASPECT><LOCATION>地球</LOCATION><ENTITY_oov>人類</ENTITY_oov><ACTION_verb>補完</ACTION_verb><ENTITY_nounHead>計劃</ENTITY_nounHead>"]

#列出所有的 content word.
[[(177, 179, "計劃"), (144, 146, "補完"), (116, 118, "人類"), (47, 49, "計劃")]]

#列出所有的 verb word. (動詞)
[[(144, 146, "補完"), (47, 49, "計劃")]]

#列出所有的 noun word. (名詞)
[[(177, 179, "計劃"), (116, 118, "人類")]]

#列出所有的 location word. (地方名稱)
[[(91, 93, "地球")]]

取得 Articut 版本列表

result = articut.versions()
pprint(result)

回傳結果

{"msg": "Success!",
 "status": True,
 "versions": [{"level": ["lv1", "lv2"],
               "release_date": "2019-04-25",
               "version": "latest"},
              {"level": ["lv1", "lv2"],
               "release_date": "2019-04-25",
               "version": "v118"},
              {"level": ["lv1", "lv2"],
               "release_date": "2019-04-24",
               "version": "v117"},...
}

進階用法

進階用法01 >> Articut Level :斷詞的深度。數字愈小,切得愈細 (預設: lv2)。

inputSTR = "小紅帽"
result = articut.parse(inputSTR, level="lv1")
pprint(result)

回傳結果 lv1

極致斷詞,適合 NLU 或機器自動翻譯使用。呈現結果將句子中的每個元素都儘量細分出來。

{"exec_time": 0.04814624786376953,
 "level": "lv1",
 "msg": "Success!",
 "result_pos": ["<MODIFIER>小</MODIFIER><MODIFIER_color>紅</MODIFIER_color><ENTITY_nounHead>帽</ENTITY_nounHead>"],
 "result_segmentation": "小/紅/帽/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9997,...}

回傳結果 lv2

詞組斷詞,適合文本分析、特徵值計算、關鍵字擷取…等應用。呈現結果將以具意義的最小單位呈現。

{"exec_time": 0.04195523262023926,
 "level": "lv2",
 "msg": "Success!",
 "result_pos": ["<ENTITY_nouny>小紅帽</ENTITY_nouny>"],
 "result_segmentation": "小紅帽/",
 "status": True,
 "version": "v118",
 "word_count_balance": 9997,...}

進階用法 02 >> UserDefinedDictFile :使用者自定詞典。

Articut UserDefined Demo

因為 Articut 只處理「語言知識」而不處理「百科知識」。 我們提供「使用者自定義」詞彙表的功能,使用 Dictionary 格式,請自行編寫。

UserDefinedFile.json

{"雷姆":["小老婆"],
 "艾蜜莉亞":["大老婆"],
 "初音未來": ["初音", "只是個軟體"],
 "李敏鎬": ["全民歐巴", "歐巴"]}

runArticut.py

from ArticutAPI import Articut
from pprint import pprint

articut = Articut()
userDefined = "./UserDefinedFile.json"
inputSTR = "我的最愛是小老婆,不是初音未來。"

# 使用自定義詞典
result = articut.parse(inputSTR, userDefinedDictFILE=userDefined)
pprint(result)

# 未使用自定義詞典
result = articut.parse(inputSTR)
pprint(result)

回傳結果

# 使用自定義詞典
{"result_pos": ["<ENTITY_pronoun>我</ENTITY_pronoun><FUNC_inner>的</FUNC_inner><ACTION_verb>最愛</ACTION_verb><AUX>是</AUX><UserDefined>小老婆</UserDefined>",
                ",",
                "<FUNC_negation>不</FUNC_negation><AUX>是</AUX><UserDefined>初音未來</UserDefined>",
                "。"],
 "result_segmentation": "我/的/最愛/是/小老婆/,/不/是/初音未來/。/",...}

# 未使用自定義詞典
{"result_pos": ["<ENTITY_pronoun>我</ENTITY_pronoun><FUNC_inner>的</FUNC_inner><ACTION_verb>最愛</ACTION_verb><AUX>是</AUX><ENTITY_nouny>小老婆</ENTITY_nouny>",
                ",",
                "<FUNC_negation>不</FUNC_negation><AUX>是</AUX><ENTITY_nouny>初音</ENTITY_nouny><TIME_justtime>未來</TIME_justtime>",
                "。"],
 "result_segmentation": "我/的/最愛/是/小老婆/,/不/是/初音/未來/。/",...}

進階用法 03 - 調用資料觀光資訊資料庫

政府開放平台中存有「交通部觀光局蒐集各政府機關所發佈空間化觀光資訊」。Articut 可取用其中的資訊,並標記為 <KNOWLEDGE_place>

上傳內容 (JSON 格式)

{
	"username": "[email protected]",
	"api_key": "[email protected]",
	"input_str": "花蓮的原野牧場有一間餐廳",
	"version": "v137",
	"level": "lv1",
	"opendata_place": true
}

回傳內容 (JSON 格式)

{
	"exec_time": 0.013453006744384766,
	"level": "lv1",
	"msg": "Success!",
	"result_pos": ["<LOCATION>花蓮</LOCATION><FUNC_inner>的</FUNC_inner><KNOWLEDGE_place>原野牧場</KNOWLEDGE_place><ACTION_verb>有</ACTION_verb><ENTITY_classifier>一間</ENTITY_classifier><ENTITY_noun>餐廳</ENTITY_noun>"],
	"result_segmentation": "花蓮/的/原野牧場/有/一間/餐廳/",
	"status": True,
	"version": "v137",
	"word_count_balance": 99987
}

進階用法 04 - 基於 TF-IDF 算法的關鍵詞抽取

  • articut.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
    • sentence 為要提取關鍵詞的文本
    • topK 為提取幾個 TF-IDF 的關鍵詞,預設值為 20
    • withWeight 為是否返回關鍵詞權重值,預設值為 False
    • allowPOS 僅抽取指定詞性的詞,預設值為空,亦即全部抽取
  • articut.analyse.TFIDF(idf_path=None) 新建 TFIDF 物件,idf_path 為 IDF 語料庫路徑

使用範例: https://github.com/Droidtown/ArticutAPI/blob/master/ArticutAPI.py#L624


進階用法 05 - 基於 TextRank 算法的關鍵詞抽取

  • articut.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=())
    • sentence 為要提取關鍵詞的文本
    • topK 為提取幾個 TF-IDF 的關鍵詞,預設值為 20
    • withWeight 為是否返回關鍵詞權重值,預設值為 False
    • allowPOS 僅抽取指定詞性的詞,預設值為空,亦即全部抽取
  • articut.analyse.TextRank() 新建 TextRank 物件

算法論文:TextRank: Bringing Order into Texts

基本思想:

  1. 將待抽取關鍵詞的文本斷詞
  2. 以固定的窗格大小 (預設值為 5,通過 span 屬性調整),詞之間的共現關係,建構出不帶權圖
  3. 計算途中節點的 PageRank

使用範例: https://github.com/Droidtown/ArticutAPI/blob/master/ArticutAPI.py#L629


進階用法 06 - 使用 GraphQL 查詢斷詞結果

Watch the video

使用 GraphiQL 工具

環境需求

Python 3.6.1
$ pip install graphene
$ pip install starlette
$ pip install jinja2
$ pip install uvicorn

執行 ArticutGraphQL.py 帶入 Articut 斷詞結果檔案路徑,並開啟瀏覽器輸入網址 http://0.0.0.0:8000/

$ python ArticutGraphQL.py articutResult.json

使用範例 01

GraphiQL Example 01

使用範例 02

GraphiQL Example 02

使用 Articut-GraphQL

安裝 graphene 模組

$ pip install graphene

使用範例 01

inputSTR = "地址:宜蘭縣宜蘭市縣政北七路六段55巷1號2樓"
result = articut.parse(inputSTR)
with open("articutResult.json", "w", encoding="utf-8") as resultFile:
    json.dump(result, resultFile, ensure_ascii=False)
	
graphQLResult = articut.graphQL.query(
    filePath="articutResult.json",
    query="""
	{
	  meta {
	    lang
	    description
	  }
	  doc {
	    text
	    tokens {
	      text
	      pos_
	      tag_
	      isStop
	      isEntity
	      isVerb
	      isTime
	      isClause
	      isKnowledge
	    }
	  }
	}""")
pprint(graphQLResult)

回傳結果

Articut-GraphQL Example 01

使用範例 02

inputSTR = "劉克襄在本次活動當中,分享了台北中山北路一日遊路線。他表示當初自己領著柯文哲一同探索了雙連市場與中山捷運站的小吃與商圈,還有商圈內的文創商店與日系雜物店鋪,都令柯文哲留下深刻的印象。劉克襄也認為,雙連市場內的魯肉飯、圓仔湯與切仔麵,還有九條通的日式店家、居酒屋等特色,也能讓人感受到台北舊城區不一樣的魅力。"
result = articut.parse(inputSTR)
with open("articutResult.json", "w", encoding="utf-8") as resultFile:
    json.dump(result, resultFile, ensure_ascii=False)
	
graphQLResult = articut.graphQL.query(
    filePath="articutResult.json",
    query="""
	{
	  meta {
	    lang
	    description
	  }
	  doc {
	    text
	    ents {
	      persons {
	        text
	        pos_
	        tag_
	      }
	    }
	  }
	}""")
pprint(graphQLResult)

回傳結果

Articut-GraphQL 回傳結果2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].