All Projects โ†’ seriousran โ†’ BERT-embedding

seriousran / BERT-embedding

Licence: other
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to BERT-embedding

KLUE
๐Ÿ“– Korean NLU Benchmark
Stars: โœญ 420 (+1650%)
Mutual labels:  korean, bert, korean-nlp
kss
Kss: A Toolkit for Korean sentence segmentation
Stars: โœญ 198 (+725%)
Mutual labels:  korean, korean-nlp
ADL2019
Applied Deep Learning (2019 Spring) @ NTU
Stars: โœญ 20 (-16.67%)
Mutual labels:  bert, contextual-embeddings
detox
Korean Hate Speech Detection Model
Stars: โœญ 38 (+58.33%)
Mutual labels:  korean, korean-nlp
LMMS
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings
Stars: โœญ 79 (+229.17%)
Mutual labels:  bert, contextual-embeddings
KoEDA
Korean Easy Data Augmentation
Stars: โœญ 62 (+158.33%)
Mutual labels:  korean, korean-nlp
g2pK
g2pK: g2p module for Korean
Stars: โœญ 137 (+470.83%)
Mutual labels:  korean, korean-nlp
hangul-search-js
๐Ÿ‡ฐ๐Ÿ‡ท Simple Korean text search module
Stars: โœญ 22 (-8.33%)
Mutual labels:  korean, korean-nlp
KoSpacing
Automatic Korean word spacing with R
Stars: โœญ 76 (+216.67%)
Mutual labels:  korean, korean-nlp
Cool-NLPCV
Some Cool NLP and CV Repositories and Solutions ๏ผˆๆ”ถ้›†NLPไธญๅธธ่งไปปๅŠก็š„ๅผ€ๆบ่งฃๅ†ณๆ–นๆกˆใ€ๆ•ฐๆฎ้›†ใ€ๅทฅๅ…ทใ€ๅญฆไน ่ต„ๆ–™็ญ‰๏ผ‰
Stars: โœญ 143 (+495.83%)
Mutual labels:  embedding, bert
PyKOMORAN
(Beta) PyKOMORAN is wrapped KOMORAN in Python using Py4J.
Stars: โœญ 38 (+58.33%)
Mutual labels:  korean, korean-nlp
tensorflow-ml-nlp-tf2
ํ…์„œํ”Œ๋กœ2์™€ ๋จธ์‹ ๋Ÿฌ๋‹์œผ๋กœ ์‹œ์ž‘ํ•˜๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ (๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€๋ถ€ํ„ฐ BERT์™€ GPT3๊นŒ์ง€) ์‹ค์Šต์ž๋ฃŒ
Stars: โœญ 245 (+920.83%)
Mutual labels:  bert, korean-nlp
AnnA Anki neuronal Appendix
Using machine learning on your anki collection to enhance the scheduling via semantic clustering and semantic similarity
Stars: โœญ 39 (+62.5%)
Mutual labels:  embedding, bert
korean-dev-books
๐Ÿ“š ํ•œ๊ตญ์–ด ๊ฐœ๋ฐœ/CS ์„œ์  ํ๋ ˆ์ด์…˜ ๋ฆฌ์ŠคํŠธ
Stars: โœญ 51 (+112.5%)
Mutual labels:  korean
iOS-Programming-Documents
iOS Programming Documents in Korean
Stars: โœญ 64 (+166.67%)
Mutual labels:  korean
KoreanTextMatcher
ํ•œ๊ธ€ ์Œ์ ˆ ๊ทผ์‚ฌ ๋งค์นญ/์ดˆ์„ฑ ๊ฒ€์ƒ‰ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Stars: โœญ 39 (+62.5%)
Mutual labels:  korean
text-classification-cn
ไธญๆ–‡ๆ–‡ๆœฌๅˆ†็ฑปๅฎž่ทต๏ผŒๅŸบไบŽๆœ็‹—ๆ–ฐ้—ป่ฏญๆ–™ๅบ“๏ผŒ้‡‡็”จไผ ็ปŸๆœบๅ™จๅญฆไน ๆ–นๆณ•ไปฅๅŠ้ข„่ฎญ็ปƒๆจกๅž‹็ญ‰ๆ–นๆณ•
Stars: โœญ 81 (+237.5%)
Mutual labels:  embedding
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: โœญ 17 (-29.17%)
Mutual labels:  bert
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: โœญ 75 (+212.5%)
Mutual labels:  bert
bert nli
A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)
Stars: โœญ 97 (+304.17%)
Mutual labels:  bert

BERT-embedding

A simple wrapper class for extracting features(embedding) and comparing them using BERT

How to Use

Installation

git clone https://github.com/seriousmac/BERT-embedding.git
cd BERT-embedding
pip install -r requirements.txt
wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
unzip multi_cased_L-12_H-768_A-12.zip -d bert/

Run a test

python bert_embedding.py

Major functions

  • bert.init() #์ดˆ๊ธฐํ™”

  • bert.extract(sentence) #๋ชจ๋“  ๊ฒฐ๊ณผ ์ถ”์ถœ, ์•„๋ž˜์˜ input and output์—์„œ ์ž…์ถœ๋ ฅ ๊ตฌ์กฐ ์ž์„ธํžˆ ์„ค๋ช…

  • bert.extracts(sentences) #string list๋ฅผ ์ž…๋ ฅ๋ฐ›์Œ

  • bert.extract_v1(sentence) #embedding ๊ฐ’๋งŒ ์ถ”์ถœ

  • bert.extracts_v1(sentences)

  • bert.cal_dif_cls(result1, result2) #extract ํ˜น์€ extracts์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ์ด์šฉํ•˜์—ฌ distance ๊ณ„์‚ฐ

  • bert.cal_dif_cls_layer(result1, result2, layer_num) #์œ„์˜ ํ•จ์ˆ˜์—์„œ ํŠน์ • layer์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ„์‚ฐ

Input and output

  • bert.extracts(sentences)
    • input: list of string
    • output: list of dict
      • 'features': ์ž…๋ ฅํ•œ ๋ฌธ์žฅ ๋‚ด ํ† ํฐ ๊ฐฏ์ˆ˜ ๋งŒํผ์˜ list
        • 'token': ํ† ํฐ ๊ฐ’
        • 'layers': list of layer dict
          • 'index': layer ๋ฒˆํ˜ธ
          • 'values': 768๊ธธ์ด์˜ float๊ฐ’ list =extracting features(embedding)

Examples

Example 1 - ํ•œ ๋ฌธ์žฅ์—์„œ embedding ์ถ”์ถœํ•˜๊ธฐ

from bert_embedding import BERT

bert = BERT()
bert.init()

sentence = "[OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค ์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ '๊ตญ๋ฏผ' ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค."
result = bert.extract(sentence)

Example 2 - ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์—์„œ embedding ์ถ”์ถœํ•˜๊ธฐ

from bert_embedding import BERT

bert = BERT()
bert.init()

sentences = ['โ€˜์„ธ๊ณ„์˜ ๊ณต์žฅโ€™์œผ๋กœ ๋ง‰๋Œ€ํ•œ ๋‹ฌ๋Ÿฌ๋ฅผ ์“ธ์–ด๋‹ด์œผ๋ฉฐ ๊ฒฝ์ œ๋ ฅ์„ ํ‚ค์› ๋˜ ์ค‘๊ตญ์˜ ์ข‹์€ ์‹œ์ ˆ๋„ ์˜ค๋ž˜๊ฐ€์ง€ ์•Š์„ ๋“ฏ>ํ•˜๋‹ค.',
   '์ž๋ณธ ์œ ์ถœ๊ณผ ์„œ๋น„์Šค ์ˆ˜์ง€ ์ ์ž ํญ์ด ์ปค์ง€๋ฉฐ ๊ฒฝ์ƒ์ˆ˜์ง€ ์ ์ž๋ฅผ ํ–ฅํ•ด ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€๊ณ  ์žˆ์–ด์„œ๋‹ค.',
   "[OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค ์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ '๊ตญ๋ฏผ' ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค.",
   "OBS '๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค'(๊ธฐํšยท์—ฐ์ถœยท๊ฐ์ˆ˜ ์œค๊ฒฝ์ฒ , ์ž‘๊ฐ€ ๋ฐ•์€๊ฒฝยท๊น€ํ˜„์„ )๊ฐ€ '๊ตญ๋ฏผ ์‹ ๋“œ๋กฌ'์„ ์ผ์œผํ‚จ ์ฒซ์‚ฌ๋ž‘์˜ ์•„์ด์ฝ˜ >๊น€์—ฐ์•„, ์ˆ˜์ง€, ์„คํ˜„์˜ ๊ทผํ™ฉ์„ ์‚ดํŽด๋ดค๋‹ค."]
results = bert.extracts(sentences)

Example 3 - CLS๋งŒ ์ด์šฉํ•ด distance๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฌธ์žฅ ์ฐพ๊ธฐ

from bert_embedding import BERT

bert = BERT()  
bert.init()

sentences = ['โ€˜์„ธ๊ณ„์˜ ๊ณต์žฅโ€™์œผ๋กœ ๋ง‰๋Œ€ํ•œ ๋‹ฌ๋Ÿฌ๋ฅผ ์“ธ์–ด๋‹ด์œผ๋ฉฐ ๊ฒฝ์ œ๋ ฅ์„ ํ‚ค์› ๋˜ ์ค‘๊ตญ์˜ ์ข‹์€ ์‹œ์ ˆ๋„ ์˜ค๋ž˜๊ฐ€์ง€ ์•Š์„ ๋“ฏ>ํ•˜๋‹ค.',
             '์ž๋ณธ ์œ ์ถœ๊ณผ ์„œ๋น„์Šค ์ˆ˜์ง€ ์ ์ž ํญ์ด ์ปค์ง€๋ฉฐ ๊ฒฝ์ƒ์ˆ˜์ง€ ์ ์ž๋ฅผ ํ–ฅํ•ด ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€๊ณ  ์žˆ์–ด์„œ๋‹ค.',
             '[OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค ์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ ๊ตญ๋ฏผ ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค.',
             'OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค(๊ธฐํšยท์—ฐ์ถœยท๊ฐ์ˆ˜ ์œค๊ฒฝ์ฒ , ์ž‘๊ฐ€ ๋ฐ•์€๊ฒฝยท๊น€ํ˜„์„ )๊ฐ€ ๊ตญ๋ฏผ ์‹ ๋“œ๋กฌ์„ ์ผ์œผํ‚จ ์ฒซ์‚ฌ๋ž‘์˜ ์•„์ด์ฝ˜ >๊น€์—ฐ์•„, ์ˆ˜์ง€, ์„คํ˜„์˜ ๊ทผํ™ฉ์„ ์‚ดํŽด๋ดค๋‹ค.',
             '์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค. ๋ง›์ง‘์„ ์ฐพ์•„ ๊ฐ€๋ณผ๊นŒ์š”? ์•„์ด๋“ค์ด ์ข‹์•„ํ•˜๋”๋ผ๊ตฌ์š”.',
             '๋ณด์Œˆ์ง‘์—์„œ๋Š” ๋ณด์Œˆ์„ ๋ง›์žˆ๊ฒŒ ํ•˜๋ฉด ๊ทธ๋งŒ์ž…๋‹ˆ๋‹ค.ใ…‹ใ…‹']

results = bert.extracts(sentences)

distances = []
for i in range(len(results)):
  distance = []
  for j in range(len(results)):
    if i == j:
      distance.append(99999)
    else:
      distance.append(bert.cal_dif_cls(results[i], results[j]))
  distances.append(distance)

for idx in range(len(sentences)):
  print(sentences[idx])
  print(sentences[distances[idx].index(min(distances[idx]))])
  print()

์ถœ๋ ฅ ๊ฒฐ๊ณผ

โ€˜์„ธ๊ณ„์˜ ๊ณต์žฅโ€™์œผ๋กœ ๋ง‰๋Œ€ํ•œ ๋‹ฌ๋Ÿฌ๋ฅผ ์“ธ์–ด๋‹ด์œผ๋ฉฐ ๊ฒฝ์ œ๋ ฅ์„ ํ‚ค์› ๋˜ ์ค‘๊ตญ์˜ ์ข‹์€ ์‹œ์ ˆ๋„ ์˜ค๋ž˜๊ฐ€์ง€ ์•Š์„ ๋“ฏ>ํ•˜๋‹ค.
์ž๋ณธ ์œ ์ถœ๊ณผ ์„œ๋น„์Šค ์ˆ˜์ง€ ์ ์ž ํญ์ด ์ปค์ง€๋ฉฐ ๊ฒฝ์ƒ์ˆ˜์ง€ ์ ์ž๋ฅผ ํ–ฅํ•ด ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€๊ณ  ์žˆ์–ด์„œ๋‹ค.

์ž๋ณธ ์œ ์ถœ๊ณผ ์„œ๋น„์Šค ์ˆ˜์ง€ ์ ์ž ํญ์ด ์ปค์ง€๋ฉฐ ๊ฒฝ์ƒ์ˆ˜์ง€ ์ ์ž๋ฅผ ํ–ฅํ•ด ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€๊ณ  ์žˆ์–ด์„œ๋‹ค.
โ€˜์„ธ๊ณ„์˜ ๊ณต์žฅโ€™์œผ๋กœ ๋ง‰๋Œ€ํ•œ ๋‹ฌ๋Ÿฌ๋ฅผ ์“ธ์–ด๋‹ด์œผ๋ฉฐ ๊ฒฝ์ œ๋ ฅ์„ ํ‚ค์› ๋˜ ์ค‘๊ตญ์˜ ์ข‹์€ ์‹œ์ ˆ๋„ ์˜ค๋ž˜๊ฐ€์ง€ ์•Š์„ ๋“ฏ>ํ•˜๋‹ค.

[OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค ์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ ๊ตญ๋ฏผ ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค.
OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค(๊ธฐํšยท์—ฐ์ถœยท๊ฐ์ˆ˜ ์œค๊ฒฝ์ฒ , ์ž‘๊ฐ€ ๋ฐ•์€๊ฒฝยท๊น€ํ˜„์„ )๊ฐ€ ๊ตญ๋ฏผ ์‹ ๋“œ๋กฌ์„ ์ผ์œผํ‚จ ์ฒซ์‚ฌ๋ž‘์˜ ์•„์ด์ฝ˜ >๊น€์—ฐ์•„, ์ˆ˜์ง€, ์„คํ˜„์˜ ๊ทผํ™ฉ์„ ์‚ดํŽด๋ดค๋‹ค.

OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค(๊ธฐํšยท์—ฐ์ถœยท๊ฐ์ˆ˜ ์œค๊ฒฝ์ฒ , ์ž‘๊ฐ€ ๋ฐ•์€๊ฒฝยท๊น€ํ˜„์„ )๊ฐ€ ๊ตญ๋ฏผ ์‹ ๋“œ๋กฌ์„ ์ผ์œผํ‚จ ์ฒซ์‚ฌ๋ž‘์˜ ์•„์ด์ฝ˜ >๊น€์—ฐ์•„, ์ˆ˜์ง€, ์„คํ˜„์˜ ๊ทผํ™ฉ์„ ์‚ดํŽด๋ดค๋‹ค.
[OBS ๋…ํŠนํ•œ ์—ฐ์˜ˆ๋‰ด์Šค ์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ ๊ตญ๋ฏผ ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค.

์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค. ๋ง›์ง‘์„ ์ฐพ์•„ ๊ฐ€๋ณผ๊นŒ์š”? ์•„์ด๋“ค์ด ์ข‹์•„ํ•˜๋”๋ผ๊ตฌ์š”.
๋ณด์Œˆ์ง‘์—์„œ๋Š” ๋ณด์Œˆ์„ ๋ง›์žˆ๊ฒŒ ํ•˜๋ฉด ๊ทธ๋งŒ์ž…๋‹ˆ๋‹ค.ใ…‹ใ…‹

๋ณด์Œˆ์ง‘์—์„œ๋Š” ๋ณด์Œˆ์„ ๋ง›์žˆ๊ฒŒ ํ•˜๋ฉด ๊ทธ๋งŒ์ž…๋‹ˆ๋‹ค.ใ…‹ใ…‹
์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค. ๋ง›์ง‘์„ ์ฐพ์•„ ๊ฐ€๋ณผ๊นŒ์š”? ์•„์ด๋“ค์ด ์ข‹์•„ํ•˜๋”๋ผ๊ตฌ์š”.

Example 4 - ๋ฌธ์žฅ ๋‚ด ํŠน์ • token์˜ embedding๋งŒ ๋น„๊ตํ•˜๊ธฐ

from bert_embedding import BERT
bert = BERT()
bert.init()

sentences = ["๋งˆ์น˜ ํ™”๋ณด ์ปท์„ ๋ฐฉ๋ถˆ์ผ€ ํ•œ ์ด๋ฒˆ ์ด๋ฏธ์ง€๋Š” ํ•ด์™ธ ๋กœ์ผ€์ดฌ์˜ ์‹œ ์ดฌ์˜๋œ ์ปท์œผ๋กœ ํŠนํžˆ, ์˜๋กœ์šฐ ์ปฌ๋Ÿฌ์˜ ๋ ˆํŠธ๋กœํ•œ ํ‹ดํŠธ์„ ๊ธ€๋ผ์Šค๋ฅผ ์ฐฉ์šฉํ•œ ์ฑ„ ์ง€ํ”„์ฐจ๋ฅผ ์šด์ „ํ•˜๋Š” ์ˆ˜์ง€์˜ ๋ชจ์Šต์—์„œ ๊ธฐ์กด์˜ ์ฒญ์ˆœํ•œ ๋ชจ์Šต๊ณผ๋Š” ๋‹ค๋ฅธ ๋„ํšŒ์ ์ธ ๋ถ„์œ„๊ธฐ์™€ ํ•œ์ธต ์„ฑ์ˆ™ํ•ด์ง„ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ๋ฉฐ ๊ทน ์ค‘ ์บ๋ฆญํ„ฐ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ์„ ๋†’์˜€๋‹ค.",
'์ž๋ณธ ์œ ์ถœ๊ณผ ์„œ๋น„์Šค ์ˆ˜์ง€ ์ ์ž ํญ์ด ์ปค์ง€๋ฉฐ ๊ฒฝ์ƒ ์ˆ˜์ง€ ์ ์ž๋ฅผ ํ–ฅํ•ด ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€๊ณ  ์žˆ์–ด์„œ๋‹ค.',
"[์กฐ์—ฐ์ˆ˜ ๊ธฐ์ž] ๊ฐ€์ˆ˜ ๊ฒธ ๋ฐฐ์šฐ ์ˆ˜์ง€๊ฐ€ ๊ตญ๋ฏผ ํƒ€์ดํ‹€์„ ๊ฑฐ๋จธ์ฅ” ์Šคํƒ€๋กœ ๊ผฝํ˜”๋‹ค."]

results = bert.extracts(sentences)

for i in range(len(results)):
  for j in range(len(results)):
    print(sentences[i])
    print(sentences[j])
    cal_dif_keyword(results[i], results[j], '์ˆ˜์ง€')

To Do List

  • Define class
  • embedding ์‰ฝ๊ฒŒ ์ถ”์ถœํ•˜๊ธฐ
  • CLS๋งŒ์„ ์ด์šฉํ•ด ๋ฌธ์žฅ์˜ distance ๊ณ„์‚ฐํ•˜๊ธฐ
  • ๋ฌธ์žฅ ๋‚ด ๋ชจ๋“  token๋“ค์˜ embedding์„ ์ด์šฉํ•ด distance ๊ณ„์‚ฐํ•˜๊ธฐ
  • ๋ฌธ์žฅ ๋‚ด ํŠน์ • token๋งŒ ๋น„๊ตํ•˜๊ธฐ (์˜ˆ: ๊ฒฝ์ œ์˜ '์ˆ˜์ง€'์™€ ์—ฐ์˜ˆ์˜ '์ˆ˜์ง€' ๊ฐ’์˜ ์ฐจ์ด ํ™•์ธํ•˜๊ธฐ)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].