KoELECTRA-Pipeline
Transformers Pipeline
with KoELECTRA
Available Pipeline
Subtask | Model | Link |
---|---|---|
NSMC | koelectra-base | koelectra-base-finetuned-nsmc |
koelectra-small | koelectra-small-finetuned-nsmc | |
Naver-NER | koelectra-base | koelectra-base-finetuned-naver-ner |
koelectra-small | koelectra-small-finetuned-naver-ner | |
KorQuad | koelectra-base-v2 | koelectra-base-v2-finetuned-korquad |
koelectra-small-v2 | koelectra-small-v2-distilled-korquad-384 |
Customized NER Pipeline
ํ๋์ Word๊ฐ ์ฌ๋ฌ ๊ฐ์ Wordpiece๋ก ์ชผ๊ฐ์ง๋ ๊ฒฝ์ฐ๊ฐ ์๋๋ฐ, NerPipeline
์ piece-level๋ก ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ค๋๋ค. ์ด๋ ์ถํ์ ๋จ์ด ๋จ์๋ก ๋ณต์ํ ๋ ๋ฌธ์ ๊ฐ ์๊ธฐ๊ฒ ๋ฉ๋๋ค.
NerPipeline
ํด๋์ค๋ฅผner_pipeline.py
์ ์ผ๋ถ ์์ ํ์ฌ ์ฌ๊ตฌํํ์์ต๋๋ค.ignore_special_tokens
๋ผ๋ ์ธ์๋ฅผ ์ถ๊ฐํ์ฌ,[CLS]
์[SEP]
ํ ํฐ์ ๊ฒฐ๊ณผ๋ฅผ ๋ฌด์ํ๊ฒ ์ฒ๋ฆฌํ ์ ์์ต๋๋ค.ignore_labels=['O']
์ผ ์O
tag๋ฅผ ์ ์ธํ๊ณ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ค๋๋ค.
Requirements
- torch>=1.4.0
- transformers==3.0.2
Run reference code
$ python3 test_nsmc.py
$ python3 test_naver_ner.py
$ python3 test_korquad.py
Example
1. NSMC
from transformers import ElectraTokenizer, ElectraForSequenceClassification, pipeline
from pprint import pprint
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-finetuned-nsmc")
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-small-finetuned-nsmc")
nsmc = pipeline(
"sentiment-analysis",
tokenizer=tokenizer,
model=model
)
print(nsmc("์ด ์ํ๋ ๋ฏธ์ณค๋ค. ๋ทํ๋ฆญ์ค๊ฐ ์ผ์ํ๋ ์๋์ ๊ทน์ฅ์ด ์กด์ฌํด์ผํ๋ ์ด์ ๋ฅผ ์ฆ๋ช
ํด์ค๋ค."))
# Out
[{'label': 'positive', 'score': 0.8729340434074402}]
2. Naver-NER
from transformers import ElectraTokenizer, ElectraForTokenClassification
from ner_pipeline import NerPipeline
from pprint import pprint
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-finetuned-naver-ner")
model = ElectraForTokenClassification.from_pretrained("monologg/koelectra-small-finetuned-naver-ner")
ner = NerPipeline(model=model,
tokenizer=tokenizer,
ignore_labels=[],
ignore_special_tokens=True)
pprint(ner("2009๋
7์ FC์์ธ์ ๋ ๋ ์๊ธ๋๋ ํ๋ฆฌ๋ฏธ์ด๋ฆฌ๊ทธ ๋ณผํด ์๋๋ฌ์ค๋ก ์ด์ ํ ์ด์ฒญ์ฉ์ ํฌ๋ฆฌ์คํ ํฐ๋ฆฌ์ค์ ๋
์ผ ๋ถ๋ฐ์ค๋ฆฌ๊ฐ2 VfL ๋ณดํ์ ๊ฑฐ์ณ ์ง๋ 3์ K๋ฆฌ๊ทธ๋ก ์ปด๋ฐฑํ๋ค. ํ์ ์ง๋ ์์ธ์ด ์๋ ์ธ์ฐ์ด์๋ค"))
# Out
[{'entity': 'DAT-B', 'score': 0.9996234178543091, 'word': '2009๋
'},
{'entity': 'DAT-I', 'score': 0.93541419506073, 'word': '7์'},
{'entity': 'ORG-B', 'score': 0.9994615912437439, 'word': 'FC์์ธ์'},
{'entity': 'O', 'score': 0.999957799911499, 'word': '๋ ๋'},
{'entity': 'LOC-B', 'score': 0.9983285069465637, 'word': '์๊ธ๋๋'},
{'entity': 'ORG-B', 'score': 0.9989873766899109, 'word': 'ํ๋ฆฌ๋ฏธ์ด๋ฆฌ๊ทธ'},
{'entity': 'ORG-B', 'score': 0.9315412044525146, 'word': '๋ณผํด'},
{'entity': 'ORG-I', 'score': 0.9993480443954468, 'word': '์๋๋ฌ์ค๋ก'},
{'entity': 'O', 'score': 0.9999217987060547, 'word': '์ด์ ํ'},
{'entity': 'PER-B', 'score': 0.9994915127754211, 'word': '์ด์ฒญ์ฉ์'},
{'entity': 'ORG-B', 'score': 0.999463677406311, 'word': 'ํฌ๋ฆฌ์คํ'},
{'entity': 'ORG-I', 'score': 0.999179482460022, 'word': 'ํฐ๋ฆฌ์ค์'},
{'entity': 'LOC-B', 'score': 0.9977350234985352, 'word': '๋
์ผ'},
{'entity': 'ORG-B', 'score': 0.9813936352729797, 'word': '๋ถ๋ฐ์ค๋ฆฌ๊ฐ2'},
{'entity': 'ORG-B', 'score': 0.8733143210411072, 'word': 'VfL'},
{'entity': 'ORG-I', 'score': 0.9937891960144043, 'word': '๋ณดํ์'},
{'entity': 'O', 'score': 0.9999728202819824, 'word': '๊ฑฐ์ณ'},
{'entity': 'DAT-B', 'score': 0.9963461756706238, 'word': '์ง๋'},
{'entity': 'DAT-I', 'score': 0.9909392595291138, 'word': '3์'},
{'entity': 'ORG-B', 'score': 0.9995419383049011, 'word': 'K๋ฆฌ๊ทธ๋ก'},
{'entity': 'O', 'score': 0.9999108910560608, 'word': '์ปด๋ฐฑํ๋ค.'},
{'entity': 'O', 'score': 0.9993030428886414, 'word': 'ํ์ ์ง๋'},
{'entity': 'ORG-B', 'score': 0.9915705323219299, 'word': '์์ธ์ด'},
{'entity': 'O', 'score': 0.9999194741249084, 'word': '์๋'},
{'entity': 'ORG-B', 'score': 0.9994401931762695, 'word': '์ธ์ฐ์ด์๋ค'}]
3. KorQuad
from transformers import ElectraTokenizer, ElectraForQuestionAnswering, pipeline
from pprint import pprint
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-v2-distilled-korquad-384")
model = ElectraForQuestionAnswering.from_pretrained("monologg/koelectra-small-v2-distilled-korquad-384")
qa = pipeline("question-answering", tokenizer=tokenizer, model=model)
pprint(qa({
"question": "ํ๊ตญ์ ๋ํต๋ น์ ๋๊ตฌ์ธ๊ฐ?",
"context": "๋ฌธ์ฌ์ธ ๋ํต๋ น์ 28์ผ ์์ธ ์ฝ์์ค์์ ์ด๋ฆฐ โ๋ฐ๋ทฐ (Deview) 2019โ ํ์ฌ์ ์ฐธ์ํด ์ ์ ๊ฐ๋ฐ์๋ค์ ๊ฒฉ๋ คํ๋ฉด์ ์ฐ๋ฆฌ ์ ๋ถ์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ณธ๊ตฌ์์ ๋ด๋์๋ค.",
}))
# Out
{'answer': '๋ฌธ์ฌ์ธ', 'end': 3, 'score': 0.9644287549022144, 'start': 0}