State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language
Installation
pip install clarinpl-embeddings
Example
Text-classification with polemo2 dataset and transformer-based embeddings
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
pipeline = LightningClassificationPipeline(
dataset_name_or_path="clarin-pl/polemo2-official",
embedding_name_or_path="allegro/herbert-base-cased",
input_column_name="text",
target_column_name="target",
output_path="."
)
print(pipeline.run())
⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to use Optimized Pipelines to select appropriate hyperparameters.
Conventions
We use many of the HuggingFace concepts such as models (https://huggingface.co/models) or datasets (https://huggingface.co/datasets) to make our library as easy to use as it is possible. We want to enable users to create, customise, test, and execute NLP / NLU / SLU tasks in the fastest possible manner. Moreover, we present easy to use static embeddings, that were trained by CLARIN-PL.
Pipelines
We share predefined pipelines for common NLP tasks with corresponding scripts. For Transformer based
pipelines we utilize PyTorch Lighting
REMARK: As currently we haven't blocked transformers based pipelines from flair pipelines we may remove it in the nearest future. We encourage to use Lightning based pipelines for transformers.
Transformer embedding based pipelines (e.g. Bert, RoBERTA, Herbert):
Task | Class | Script |
---|---|---|
Text classification | LightningClassificationPipeline | evaluate_lightning_document_classification.py |
Sequence labelling | LightningSequenceLabelingPipeline | evaluate_lightning_sequence_labeling.py |
Static embedding based pipelines (e.g. word2vec, fasttext)
Task | Class | Script |
---|---|---|
Text classification | FlairClassificationPipeline | evaluate_document_classification.py |
Sequence labelling | FlairSequenceLabelingPipeline | evaluate_sequence_labelling.py |
Sequence pair classification | FlairPairClassificationPipeline | evaluate_document_pair_classification.py |
Writing custom HuggingFace-based pipeline
from pathlib import Path
from embeddings.data.data_loader import HuggingFaceDataLoader
from embeddings.data.dataset import Dataset
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from embeddings.evaluator.text_classification_evaluator import TextClassificationEvaluator
from embeddings.model.flair_model import FlairModel
from embeddings.pipeline.standard_pipeline import StandardPipeline
from embeddings.task.flair_task.text_classification import TextClassification
from embeddings.transformation.flair_transformation.classification_corpus_transformation import (
ClassificationCorpusTransformation,
)
dataset = Dataset("clarin-pl/polemo2-official")
data_loader = HuggingFaceDataLoader()
transformation = ClassificationCorpusTransformation("text", "target")
embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
task = TextClassification(Path("."))
model = FlairModel(embedding, task)
evaluator = TextClassificationEvaluator()
pipeline = StandardPipeline(dataset, data_loader, transformation, model, evaluator)
result = pipeline.run()
Running tasks scripts
All up-to-date examples can be found under examples path.
cd examples
Run classification task
The example with non-default arguments
python evaluate_lightning_document_classification.py \
--embedding-name-or-path allegro/herbert-base-cased \
--dataset-name clarin-pl/polemo2-official \
--input-columns-name text \
--target-column-name target
Run sequence labeling task
The example with default language model and dataset.
python evaluate_lightning_sequence_labeling.py
Run pair classification task
The example with static embedding model.
python evaluate_document_pair_classification.py \
--embedding-name-or-path clarin-pl/word2vec-kgr10
Compatible datasets
As most datasets in HuggingFace repository should be compatible with our pipelines, there are several datasets that were tested by the authors.
dataset name | task type | input_column_name(s) | target_column_name | description |
---|---|---|---|---|
clarin-pl/kpwr-ner | sequence labeling (named entity recognition) | tokens | ner | KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc. |
clarin-pl/polemo2-official | classification (sentiment analysis) | text | target | A corpus of consumer reviews from 4 domains: medicine, hotels, products and school. |
clarin-pl/2021-punctuation-restoration | punctuation restoration | text_in | text_out | Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition. |
clarin-pl/nkjp-pos | sequence labeling (part-of-speech tagging) | tokens | pos_tags | NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc. |
clarin-pl/aspectemo | sequence labeling (sentiment classification) | tokens | labels | AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis. |
laugustyniak/political-advertising-pl | sequence labeling (political advertising ) | tokens | tags | First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. |
laugustyniak/abusive-clauses-pl | classification (abusive-clauses) | text | class | Dataset with Polish abusive clauses examples. |
allegro/klej-dyk | pair classification (question answering)* | (question, answer) | target | The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs. |
allegro/klej-psc | pair classification (text summarization)* | (extract_text, summary_text) | label | The Polish Summaries Corpus contains news articles and their summaries. |
allegro/klej-cdsc-e | pair classification (textual entailment)* | (sentence_A, sentence_B) | entailment_judgment | The polish sentence pairs which are human-annotated for textualentailment. |
*only pair classification task is supported for now
Passing task model and task training parameters to predefined flair pipelines
Model and training parameters can be controlled via task_model_kwargs
and
task_train_kwargs
parameters that can be populated using the advanced config. Tutorial on how to
use configs can be found in /tutorials
directory of the repository. Two types of config are
defined in our library: BasicConfig and AdvancedConfig. In summary, the BasicConfig takes arguments
and automatically assign them into proper keyword group, while the AdvancedConfig takes as the input
keyword groups that should be already correctly mapped.
The list of available config can be found below:
Flair:
- FlairBasicConfig
- FlairSequenceLabelingBasicConfig
- FlairTextClassificationBasicConfig
- FlairSequenceLabelingAdvancedConfig
- FlairTextClassificationAdvancedConfig
Lightning:
- LightningBasicConfig
- LightningAdvancedConfig
polemo2
dataset
Example with Flair pipeline
from embeddings.pipeline.flair_classification import FlairClassificationPipeline
from embeddings.config.flair_config import FlairTextClassificationAdvancedConfig
config = FlairTextClassificationAdvancedConfig(
task_model_kwargs={
"loss_weights": {
"plus": 2.0,
"minus": 2.0
}
},
task_train_kwargs={
"learning_rate": 0.01,
"max_epochs": 20
}
)
pipeline = FlairClassificationPipeline(
dataset_name="clarin-pl/polemo2-official",
embedding_name="clarin-pl/word2vec-kgr10",
input_column_name="text",
target_column_name="target",
output_path=".",
config=config
)
print(pipeline.run())
Lightning pipeline
from embeddings.config.lightning_config import LightningBasicConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
config = LightningBasicConfig(
learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0,
accelerator="cpu"
)
pipeline = LightningClassificationPipeline(
embedding_name_or_path="allegro/herbert-base-cased",
dataset_name_or_path="clarin-pl/polemo2-official",
input_column_name=["text"],
target_column_name="target",
load_dataset_kwargs={
"train_domains": ["hotels", "medicine"],
"dev_domains": ["hotels", "medicine"],
"test_domains": ["hotels", "medicine"],
"text_cfg": "text",
},
output_path=".",
config=config
)
You can also define an Advanced config with populated keyword arguments. In general, the keywords are passed to the object when constructing specific pipelines. We can identify and trace the keyword arguments to find the possible arguments that can be set in the config kwargs.
from embeddings.config.lightning_config import LightningAdvancedConfig
config = LightningAdvancedConfig(
finetune_last_n_layers=0,
task_train_kwargs={
"max_epochs": 1,
"devices": "auto",
"accelerator": "cpu",
"deterministic": True,
},
task_model_kwargs={
"learning_rate": 5e-4,
"use_scheduler": False,
"optimizer": "AdamW",
"adam_epsilon": 1e-8,
"warmup_steps": 100,
"weight_decay": 0.0,
},
datamodule_kwargs={
"downsample_train": 0.01,
"downsample_val": 0.01,
"downsample_test": 0.05,
},
dataloader_kwargs={"num_workers": 0},
)
Static embeddings
Computed vectors are stored in Flair structures
Document embeddings
from flair.data import Sentence
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])
print(sentence.embedding)
Word embeddings
from flair.data import Sentence
from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])
for token in sentence:
print(token)
print(token.embedding)
Available embedding models for Polish
Instead of the allegro/herbert-base-cased
model, user can pass any model
from HuggingFace Hub that is compatible
with Transformers or with our library.
Embedding | Type | Description |
---|---|---|
clarin-pl/herbert-kgr10 | bert | HerBERT Large trained on supplementary data - the KGR10 corpus. |
clarin-pl/fastText-kgr10 | static, word | FastText trained on trained on the KGR10 corpus. |
clarin-pl/word2vec-kgr10 | static, word | Word2vec trained on trained on the KGR10 corpus. |
... |
Optimized pipelines
Transformers embeddings
Task | Optimized Pipeline |
---|---|
Lightning Text Classification | OptimizedLightingClassificationPipeline |
Lightning Sequence Labeling | OptimizedLightingSequenceLabelingPipeline |
Static embeddings
Task | Optimized Pipeline |
---|---|
Flair Text Classification | OptimizedFlairClassificationPipeline |
Flair Pair Text Classification | OptimizedFlairPairClassificationPipeline |
Flair Sequence Labeling | OptimizedFlairSequenceLabelingPipeline |
Example with Text Classification
Optimized pipelines can be run via following snippet of code:
from embeddings.config.lighting_config_space import LightingTextClassificationConfigSpace
from embeddings.pipeline.lightning_hps_pipeline import OptimizedLightingClassificationPipeline
pipeline = OptimizedLightingClassificationPipeline(
config_space=LightingTextClassificationConfigSpace(
embedding_name_or_path="allegro/herbert-base-cased"
),
dataset_name_or_path="clarin-pl/polemo2-official",
input_column_name="text",
target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")
df, metadata = pipeline.run()
Training model with obtained parameters
After the parameters search process we can train model with best parameters found. But firstly we
have to set output_path
parameter, which is not automatically generated
from OptimizedLightingClassificationPipeline
.
metadata["output_path"] = "."
Now we are able to train the pipeline
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
pipeline = LightningClassificationPipeline(**metadata)
results = pipeline.run()
Selection of best embedding model.
Instead of performing search with single embedding model we can search with multiple embedding models via passing them as list to ConfigSpace.
pipeline = OptimizedLightingClassificationPipeline(
config_space=LightingTextClassificationConfigSpace(
embedding_name_or_path=["allegro/herbert-base-cased", "clarin-pl/roberta-polish-kgr10"]
),
dataset_name_or_path="clarin-pl/polemo2-official",
input_column_name="text",
target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")