All Projects → CLARIN-PL → embeddings

CLARIN-PL / embeddings

Licence: MIT license
Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
HTML
75241 projects

Projects that are alternatives of or similar to embeddings

Predicting Myers Briggs Type Indicator With Recurrent Neural Networks
Stars: ✭ 43 (+59.26%)
Mutual labels:  classification, nlp-machine-learning
Java Deep Learning Cookbook
Code for Java Deep Learning Cookbook
Stars: ✭ 156 (+477.78%)
Mutual labels:  classification, nlp-machine-learning
Benchmarks
Comparison tools
Stars: ✭ 139 (+414.81%)
Mutual labels:  benchmark, classification
quic vs tcp
A Survey and Benchmark of QUIC
Stars: ✭ 41 (+51.85%)
Mutual labels:  benchmark
awesome-text-classification
Text classification meets word embeddings.
Stars: ✭ 27 (+0%)
Mutual labels:  classification
Online-Category-Learning
ML algorithm for real-time classification
Stars: ✭ 67 (+148.15%)
Mutual labels:  classification
benchmarkjs-pretty
Tiny wrapper around benchmarkjs with a nicer api
Stars: ✭ 20 (-25.93%)
Mutual labels:  benchmark
EC-GAN
EC-GAN: Low-Sample Classification using Semi-Supervised Algorithms and GANs (AAAI 2021)
Stars: ✭ 29 (+7.41%)
Mutual labels:  classification
SQL-ProcBench
SQL-ProcBench is an open benchmark for procedural workloads in RDBMSs.
Stars: ✭ 26 (-3.7%)
Mutual labels:  benchmark
Relation-Classification
Relation Classification - SEMEVAL 2010 task 8 dataset
Stars: ✭ 46 (+70.37%)
Mutual labels:  classification
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (+55.56%)
Mutual labels:  nlp-machine-learning
MDBenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
Stars: ✭ 64 (+137.04%)
Mutual labels:  benchmark
SafeAI
Reusable, Easy-to-use Uncertainty module package built with Tensorflow, Keras
Stars: ✭ 13 (-51.85%)
Mutual labels:  classification
cnn-rnn-classifier
A practical example on how to combine both a CNN and a RNN to classify images.
Stars: ✭ 47 (+74.07%)
Mutual labels:  classification
HArray
Fastest Trie structure (Linux & Windows)
Stars: ✭ 89 (+229.63%)
Mutual labels:  benchmark
dzetsaka
dzetsaka : classification plugin for Qgis
Stars: ✭ 61 (+125.93%)
Mutual labels:  classification
vita
Vita - Genetic Programming Framework
Stars: ✭ 24 (-11.11%)
Mutual labels:  classification
Python-Complementary-Languages
Just a small test to see which language is better for extending python when using lists of lists
Stars: ✭ 32 (+18.52%)
Mutual labels:  benchmark
scoruby
Ruby Scoring API for PMML
Stars: ✭ 69 (+155.56%)
Mutual labels:  classification
ufw
A minimalist framework for rapid server side applications prototyping in C++ with dependency injection support.
Stars: ✭ 19 (-29.63%)
Mutual labels:  benchmark

State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

CI Main

🚧️ The library is currently in an active development state. Some functionalities may be subject to change before the stable release. Users can track our milestones here.

Installation

pip install clarinpl-embeddings

Example

Text-classification with polemo2 dataset and transformer-based embeddings

from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

pipeline = LightningClassificationPipeline(
    dataset_name_or_path="clarin-pl/polemo2-official",
    embedding_name_or_path="allegro/herbert-base-cased",
    input_column_name="text",
    target_column_name="target",
    output_path="."
)

print(pipeline.run())

⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to use Optimized Pipelines to select appropriate hyperparameters.

Conventions

We use many of the HuggingFace concepts such as models (https://huggingface.co/models) or datasets (https://huggingface.co/datasets) to make our library as easy to use as it is possible. We want to enable users to create, customise, test, and execute NLP / NLU / SLU tasks in the fastest possible manner. Moreover, we present easy to use static embeddings, that were trained by CLARIN-PL.

Pipelines

We share predefined pipelines for common NLP tasks with corresponding scripts. For Transformer based pipelines we utilize PyTorch Lighting trainers with Transformers AutoModels . For static embedding based pipelines we use Flair library under the hood.

REMARK: As currently we haven't blocked transformers based pipelines from flair pipelines we may remove it in the nearest future. We encourage to use Lightning based pipelines for transformers.

Transformer embedding based pipelines (e.g. Bert, RoBERTA, Herbert):

Task Class Script
Text classification LightningClassificationPipeline evaluate_lightning_document_classification.py
Sequence labelling LightningSequenceLabelingPipeline evaluate_lightning_sequence_labeling.py

Static embedding based pipelines (e.g. word2vec, fasttext)

Task Class Script
Text classification FlairClassificationPipeline evaluate_document_classification.py
Sequence labelling FlairSequenceLabelingPipeline evaluate_sequence_labelling.py
Sequence pair classification FlairPairClassificationPipeline evaluate_document_pair_classification.py

Writing custom HuggingFace-based pipeline

from pathlib import Path

from embeddings.data.data_loader import HuggingFaceDataLoader
from embeddings.data.dataset import Dataset
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from embeddings.evaluator.text_classification_evaluator import TextClassificationEvaluator
from embeddings.model.flair_model import FlairModel
from embeddings.pipeline.standard_pipeline import StandardPipeline
from embeddings.task.flair_task.text_classification import TextClassification
from embeddings.transformation.flair_transformation.classification_corpus_transformation import (
    ClassificationCorpusTransformation,
)

dataset = Dataset("clarin-pl/polemo2-official")
data_loader = HuggingFaceDataLoader()
transformation = ClassificationCorpusTransformation("text", "target")
embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
task = TextClassification(Path("."))
model = FlairModel(embedding, task)
evaluator = TextClassificationEvaluator()

pipeline = StandardPipeline(dataset, data_loader, transformation, model, evaluator)
result = pipeline.run()

Running tasks scripts

All up-to-date examples can be found under examples path.

cd examples

Run classification task

The example with non-default arguments

python evaluate_lightning_document_classification.py \
    --embedding-name-or-path allegro/herbert-base-cased \
    --dataset-name clarin-pl/polemo2-official \
    --input-columns-name text \
    --target-column-name target

Run sequence labeling task

The example with default language model and dataset.

python evaluate_lightning_sequence_labeling.py

Run pair classification task

The example with static embedding model.

python evaluate_document_pair_classification.py \
    --embedding-name-or-path clarin-pl/word2vec-kgr10

Compatible datasets

As most datasets in HuggingFace repository should be compatible with our pipelines, there are several datasets that were tested by the authors.

dataset name task type input_column_name(s) target_column_name description
clarin-pl/kpwr-ner sequence labeling (named entity recognition) tokens ner KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc.
clarin-pl/polemo2-official classification (sentiment analysis) text target A corpus of consumer reviews from 4 domains: medicine, hotels, products and school.
clarin-pl/2021-punctuation-restoration punctuation restoration text_in text_out Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition.
clarin-pl/nkjp-pos sequence labeling (part-of-speech tagging) tokens pos_tags NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc.
clarin-pl/aspectemo sequence labeling (sentiment classification) tokens labels AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis.
laugustyniak/political-advertising-pl sequence labeling (political advertising ) tokens tags First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language.
laugustyniak/abusive-clauses-pl classification (abusive-clauses) text class Dataset with Polish abusive clauses examples.
allegro/klej-dyk pair classification (question answering)* (question, answer) target The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs.
allegro/klej-psc pair classification (text summarization)* (extract_text, summary_text) label The Polish Summaries Corpus contains news articles and their summaries.
allegro/klej-cdsc-e pair classification (textual entailment)* (sentence_A, sentence_B) entailment_judgment The polish sentence pairs which are human-annotated for textualentailment.

*only pair classification task is supported for now

Passing task model and task training parameters to predefined flair pipelines

Model and training parameters can be controlled via task_model_kwargs and task_train_kwargs parameters that can be populated using the advanced config. Tutorial on how to use configs can be found in /tutorials directory of the repository. Two types of config are defined in our library: BasicConfig and AdvancedConfig. In summary, the BasicConfig takes arguments and automatically assign them into proper keyword group, while the AdvancedConfig takes as the input keyword groups that should be already correctly mapped.

The list of available config can be found below:

Flair:

  • FlairBasicConfig
  • FlairSequenceLabelingBasicConfig
  • FlairTextClassificationBasicConfig
  • FlairSequenceLabelingAdvancedConfig
  • FlairTextClassificationAdvancedConfig

Lightning:

  • LightningBasicConfig
  • LightningAdvancedConfig

Example with polemo2 dataset

Flair pipeline

from embeddings.pipeline.flair_classification import FlairClassificationPipeline
from embeddings.config.flair_config import FlairTextClassificationAdvancedConfig

config = FlairTextClassificationAdvancedConfig(
    task_model_kwargs={
        "loss_weights": {
            "plus": 2.0,
            "minus": 2.0
        }
    },
    task_train_kwargs={
        "learning_rate": 0.01,
        "max_epochs": 20
    }
)
pipeline = FlairClassificationPipeline(
    dataset_name="clarin-pl/polemo2-official",
    embedding_name="clarin-pl/word2vec-kgr10",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=config
)

print(pipeline.run())

Lightning pipeline

from embeddings.config.lightning_config import LightningBasicConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0,
    accelerator="cpu"
)

pipeline = LightningClassificationPipeline(
    embedding_name_or_path="allegro/herbert-base-cased",
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name=["text"],
    target_column_name="target",
    load_dataset_kwargs={
        "train_domains": ["hotels", "medicine"],
        "dev_domains": ["hotels", "medicine"],
        "test_domains": ["hotels", "medicine"],
        "text_cfg": "text",
    },
    output_path=".",
    config=config
)

You can also define an Advanced config with populated keyword arguments. In general, the keywords are passed to the object when constructing specific pipelines. We can identify and trace the keyword arguments to find the possible arguments that can be set in the config kwargs.

from embeddings.config.lightning_config import LightningAdvancedConfig

config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    task_train_kwargs={
        "max_epochs": 1,
        "devices": "auto",
        "accelerator": "cpu",
        "deterministic": True,
    },
    task_model_kwargs={
        "learning_rate": 5e-4,
        "use_scheduler": False,
        "optimizer": "AdamW",
        "adam_epsilon": 1e-8,
        "warmup_steps": 100,
        "weight_decay": 0.0,
    },
    datamodule_kwargs={
        "downsample_train": 0.01,
        "downsample_val": 0.01,
        "downsample_test": 0.05,
    },
    dataloader_kwargs={"num_workers": 0},
)

Static embeddings

Computed vectors are stored in Flair structures

Document embeddings

from flair.data import Sentence

from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])

print(sentence.embedding)

Word embeddings

from flair.data import Sentence

from embeddings.embedding.auto_flair import AutoFlairWordEmbedding

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)

Available embedding models for Polish

Instead of the allegro/herbert-base-cased model, user can pass any model from HuggingFace Hub that is compatible with Transformers or with our library.

Embedding Type Description
clarin-pl/herbert-kgr10 bert HerBERT Large trained on supplementary data - the KGR10 corpus.
clarin-pl/fastText-kgr10 static, word FastText trained on trained on the KGR10 corpus.
clarin-pl/word2vec-kgr10 static, word Word2vec trained on trained on the KGR10 corpus.
...

Optimized pipelines

Transformers embeddings

Task Optimized Pipeline
Lightning Text Classification OptimizedLightingClassificationPipeline
Lightning Sequence Labeling OptimizedLightingSequenceLabelingPipeline

Static embeddings

Task Optimized Pipeline
Flair Text Classification OptimizedFlairClassificationPipeline
Flair Pair Text Classification OptimizedFlairPairClassificationPipeline
Flair Sequence Labeling OptimizedFlairSequenceLabelingPipeline

Example with Text Classification

Optimized pipelines can be run via following snippet of code:

from embeddings.config.lighting_config_space import LightingTextClassificationConfigSpace
from embeddings.pipeline.lightning_hps_pipeline import OptimizedLightingClassificationPipeline

pipeline = OptimizedLightingClassificationPipeline(
    config_space=LightingTextClassificationConfigSpace(
        embedding_name_or_path="allegro/herbert-base-cased"
    ),
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name="text",
    target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")
df, metadata = pipeline.run()

Training model with obtained parameters

After the parameters search process we can train model with best parameters found. But firstly we have to set output_path parameter, which is not automatically generated from OptimizedLightingClassificationPipeline.

metadata["output_path"] = "."

Now we are able to train the pipeline

from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

pipeline = LightningClassificationPipeline(**metadata)
results = pipeline.run()

Selection of best embedding model.

Instead of performing search with single embedding model we can search with multiple embedding models via passing them as list to ConfigSpace.

pipeline = OptimizedLightingClassificationPipeline(
    config_space=LightingTextClassificationConfigSpace(
        embedding_name_or_path=["allegro/herbert-base-cased", "clarin-pl/roberta-polish-kgr10"]
    ),
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name="text",
    target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].