All Projects → nocoolsandwich → iamQA

nocoolsandwich / iamQA

Licence: other
中文wiki百科QA阅读理解问答系统,使用了CCKS2016数据的NER模型和CMRC2018的阅读理解模型,还有W2V词向量搜索,使用torchserve部署

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to iamQA

DrFAQ
DrFAQ is a plug-and-play question answering NLP chatbot that can be generally applied to any organisation's text corpora.
Stars: ✭ 29 (-36.96%)
Mutual labels:  question-answering, bert
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+397.83%)
Mutual labels:  question-answering, bert
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+52.17%)
Mutual labels:  question-answering, bert
cmrc2019
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)
Stars: ✭ 118 (+156.52%)
Mutual labels:  question-answering, bert
Medi-CoQA
Conversational Question Answering on Clinical Text
Stars: ✭ 22 (-52.17%)
Mutual labels:  question-answering, bert
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+7310.87%)
Mutual labels:  question-answering, bert
TriB-QA
吹逼我们是认真的
Stars: ✭ 45 (-2.17%)
Mutual labels:  question-answering, bert
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+14369.57%)
Mutual labels:  question-answering, bert
SQUAD2.Q-Augmented-Dataset
Augmented version of SQUAD 2.0 for Questions
Stars: ✭ 31 (-32.61%)
Mutual labels:  question-answering, bert
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: ✭ 75 (+63.04%)
Mutual labels:  question-answering, bert
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models
Stars: ✭ 58 (+26.09%)
Mutual labels:  question-answering, bert
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (-58.7%)
Mutual labels:  question-answering, bert
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+308.7%)
Mutual labels:  question-answering, bert
mcQA
🔮 Answering multiple choice questions with Language Models.
Stars: ✭ 23 (-50%)
Mutual labels:  question-answering, bert
erc
Emotion recognition in conversation
Stars: ✭ 34 (-26.09%)
Mutual labels:  bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-52.17%)
Mutual labels:  bert
syntaxdot
Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
Stars: ✭ 32 (-30.43%)
Mutual labels:  bert
Mengzi
Mengzi Pretrained Models
Stars: ✭ 238 (+417.39%)
Mutual labels:  bert
bert-as-a-service TFX
End-to-end pipeline with TFX to train and deploy a BERT model for sentiment analysis.
Stars: ✭ 32 (-30.43%)
Mutual labels:  bert
KoBERT-nsmc
Naver movie review sentiment classification with KoBERT
Stars: ✭ 57 (+23.91%)
Mutual labels:  bert

iamQA

中文wiki百科问答系统,本项目使用了torchserver部署模型(不推荐用torchserve,如果你可以的话,用flask就好了,debug也比较方便)

知识库:wiki百科中文数据

模型:使用了的NER(CCKS2016数据)阅读理解模型(CMRC2018),还有Word2Vec词向量搜索。

详细内容可以参考文章:WIKI+ALBERT+NER+W2V+Torchserve+前端的中文问答系统开源项目

项目框架

项目框架

模块介绍

使用说明

  1. 下载项目
    windows直接下载,linux可用

    git clone https://github.com/nocoolsandwich/wikiCH_QA.git
  2. 安装torchserve

    参考install-torchserve windows注意openjdk 11的安装方法不一样,可参考这个文章

  3. 安装requirements.txt

    使用豆瓣源快些

    pip install -U -r requirements.txt -i https://pypi.douban.com/simple/
  4. 下载准备文件

  • wiki中文数据,下载地址

    linux可用

    wget https://dumps.wikimedia.org/zhwiki/20201120/zhwiki-20201120-pages-articles-multistream.xml.bz2

    文件大小约2G,无需解压,放入ChineseWiki-master根目录

  • NER的albert模型

    模型我已训练好,文件总大小约16M,下载地址

    drive baiduyun(提取码:1234)
    NER_model NER_model

    下载后存放路径:NER\model

  • reader的albert模型

    模型我已训练好,文件总大小约35M,下载地址

    drive baiduyun(提取码:1234)
    reader_model reader_model

    下载后存放路径:reader

  • W2V 下载地址

    Word2vec/Skip-Gram with Negative Sampling (SGNS)下的Mixed-large 综合Baidu Netdisk/Google Drive的Word

    或者通过这其中一个链接下载:

    drive baiduyun
    W2V.file W2V.file

    下载解压后将sgns.merge.word存放路径:W2V
    W2V下执行运行to_pickle.py可以得到文件W2V.pickle,这一步是为了把读进gensim的词向量转换成pickle,这样后续启动torchserve的时候可以更加快速,运行to_pickle.py的时间比较久,你可以先往后做,同步进行也是没问题的。

  1. wiki数据清洗

    依次运行1wiki_to_txt.py,2wiki_txt_to_csv.py,3wiki_csv_to_json.py,4wiki_json_to_DB.py

    输出:ChineseWiki-master\DB_output\output.db,然后把output.db放入reader下

  2. torchserve打包模型,启动torchserve服务

    NER目录执行

    torch-model-archiver --model-name NER --version 1.0 \
    --serialized-file ./Transformer_handler_generalized.py \
    --handler ./Transformer_handler_generalized.py --extra-files \
    "./model/find_NER.py,./model/best_ner.bin,./model/SIM_main.py,./model/CRF_Model.py,./model/BERT_CRF.py,./model/NER_main.py"

    reader目录执行

    torch-model-archiver --model-name reader --version 1.0 \
    --serialized-file ./checkpoint_score_f1-86.233_em-66.853.pth \
    --handler ./Transformer_handler_generalized.py \
    --extra-files "./setup_config.json,./inference.py,./official_tokenization.py,./output.db"

    W2V目录执行

    torch-model-archiver --model-name W2V --version 1.0 --serialized-file ./W2V.pickle --handler ./Transformer_handler_generalized.py

    wikiCH_QA目录执行

    mkdir model_store \
    mv NER/NER.mar model_store/ \
    mv W2V/W2V.mar model_store/ \
    cp W2V/config.properties config.properties \
    mv reader/reader.mar model_store/ \

    启动服务

    torchserve --start --ts-config config.properties --model-store model_store \
    --models reader=reader.mar,NER=NER.mar,W2V=W2V.mar
  3. 启动web服务
    drqa-webui-master下执行

    gunicorn --timeout 300 index:app

    访问http://localhost:8000

项目说明

  • NER模块在CCKS2016KBQA准确率98%
  • reader模块在CMRC2018EM:66%,F1:86%
  • 你的知识库可以更换,只需要一个带有id,doc字段的sqlite数据库即可,id为实体名,doc为实体对应的文档,文档尽可能小于512个字符,因为受限于bert的输入长度。

效果展示

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].