Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → nlpaueb → Greek Bert

nlpaueb / Greek Bert

Licence: mit

A Greek edition of BERT pre-trained language model

Programming Languages

139335 projects - #7 most used programming language

Labels

natural-language-processing language-model

Projects that are alternatives of or similar to Greek Bert

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+2263.1%)

Mutual labels: natural-language-processing, language-model

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Stars: ✭ 311 (+270.24%)

Mutual labels: natural-language-processing, language-model

a sklearn wrapper for Google's BERT model

Stars: ✭ 182 (+116.67%)

Mutual labels: natural-language-processing, language-model

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+66259.52%)

Mutual labels: natural-language-processing, language-model

Spacy Transformers

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Stars: ✭ 919 (+994.05%)

Mutual labels: natural-language-processing, language-model

Character-based word embeddings model based on RNN for handling real world texts

Stars: ✭ 130 (+54.76%)

Mutual labels: natural-language-processing, language-model

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

Stars: ✭ 273 (+225%)

Mutual labels: natural-language-processing, language-model

package lingo provides the data structures and algorithms required for natural language processing

Stars: ✭ 113 (+34.52%)

Mutual labels: natural-language-processing, language-model

Dl Nlp Readings

My Reading Lists of Deep Learning and Natural Language Processing

Stars: ✭ 656 (+680.95%)

Mutual labels: natural-language-processing, language-model

Awesome Bert Nlp

A curated list of NLP resources focused on BERT, attention mechanism, Transformer networks, and transfer learning.

Stars: ✭ 567 (+575%)

Mutual labels: natural-language-processing, language-model

A Dead Simple BERT API for Python and Java (https://github.com/google-research/bert)

Stars: ✭ 106 (+26.19%)

Mutual labels: natural-language-processing, language-model

Vietnamese Electra

Electra pre-trained model using Vietnamese corpus

Stars: ✭ 55 (-34.52%)

Mutual labels: natural-language-processing, language-model

Attention Mechanisms

Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.

Stars: ✭ 203 (+141.67%)

Mutual labels: natural-language-processing, language-model

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Stars: ✭ 5,077 (+5944.05%)

Mutual labels: natural-language-processing, language-model

Self-contained Machine Learning and Natural Language Processing library in Go

Stars: ✭ 854 (+916.67%)

Mutual labels: natural-language-processing, language-model

PyTorch Implementation of OpenAI GPT-2

Stars: ✭ 64 (-23.81%)

Mutual labels: natural-language-processing, language-model

Term extraction for Russian language

Stars: ✭ 75 (-10.71%)

Mutual labels: natural-language-processing

Text Dependency Parser

🏄 依存关系分析，NLP，自然语言处理

Stars: ✭ 78 (-7.14%)

Mutual labels: natural-language-processing

The most popular spellchecking library.

Stars: ✭ 1,196 (+1323.81%)

Mutual labels: natural-language-processing

Natural Language Processing Tutorial for Deep Learning Researchers

Stars: ✭ 9,895 (+11679.76%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

GreekBERT

A Greek edition of Google's BERT pre-trained language model.

Pre-training corpora

The pre-training corpora of bert-base-greek-uncased-v1 include:

The Greek part of Wikipedia,
The Greek part of European Parliament Proceedings Parallel Corpus, and
The Greek part of OSCAR, a cleansed version of Common Crawl.

Future release will also include:

The entire corpus of Greek legislation, as published by the National Publication Office,
The entire corpus of EU legislation (Greek translation), as published in Eur-Lex.

Pre-training details

We trained BERT using the official code provided in Google BERT's github repository (https://github.com/google-research/bert).
We released a model similar to the English bert-base-uncased model (12-layer, 768-hidden, 12-heads, 110M parameters).
We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
We were able to use a single Google Cloud TPU v3-8 provided for free from TensorFlow Research Cloud (TFRC), while also utilizing GCP research credits. Huge thanks to both Google programs for supporting us!

Requirements

We published bert-base-greek-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transfomers library through pip along with PyTorch or Tensorflow 2.

pip install unicodedata
pip install transfomers
pip install (torch|tensorflow)

Pre-process text (Deaccent - Lower)

In order to use bert-base-greek-uncased-v1, you have to pre-process texts to lowercase letters and remove all Greek diacritics.

import unicodedata

def strip_accents_and_lowercase(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn').lower()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)

print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')

# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek.encode(text_1)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_1)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"

Evaluation on downstream tasks

TBA

Author

Ilias Chalkidis on behalf of AUEB's Natural Language Processing Group

| Github: @ilias.chalkidis | Twitter: @KiddoThe2B |

About Us

AUEB's Natural Language Processing Group develops algorithms, models, and systems that allow computers to process and generate natural language texts.

The group's current research interests include:

question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
natural language generation from databases and ontologies, especially Semantic Web ontologies, text classification, including filtering spam and abusive content,
information extraction and opinion mining, including legal text analytics and sentiment analysis,
natural language processing tools for Greek, for example parsers and named-entity recognizers, machine learning in natural language processing, especially deep learning.

The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 84

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗