All Projects → TakeLab → podium

TakeLab / podium

Licence: BSD-3-Clause License
Podium: a framework agnostic Python NLP library for data loading and preprocessing

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to podium

the-weather-scraper
A Lightweight Weather Scraper
Stars: ✭ 56 (+1.82%)
Mutual labels:  datasets
databrewer
The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
Stars: ✭ 39 (-29.09%)
Mutual labels:  datasets
download audioset
📁 This repo makes it easy to download the raw audio files from AudioSet (32.45 GB, 632 classes).
Stars: ✭ 53 (-3.64%)
Mutual labels:  datasets
tweets-preprocessor
Repo containing the Twitter preprocessor module, developed by the AUTH OSWinds team
Stars: ✭ 26 (-52.73%)
Mutual labels:  preprocessing
pywedge
Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking
Stars: ✭ 49 (-10.91%)
Mutual labels:  preprocessing
BrainPrep
Preprocessing pipeline on Brain MR Images through FSL and ANTs, including registration, skull-stripping, bias field correction, enhancement and segmentation.
Stars: ✭ 107 (+94.55%)
Mutual labels:  preprocessing
text-normalizer
Normalize text string
Stars: ✭ 12 (-78.18%)
Mutual labels:  preprocessing
TSForecasting
This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.
Stars: ✭ 53 (-3.64%)
Mutual labels:  datasets
ck-env
CK repository with components and automation actions to enable portable workflows across diverse platforms including Linux, Windows, MacOS and Android. It includes software detection plugins and meta packages (code, data sets, models, scripts, etc) with the possibility of multiple versions to co-exist in a user or system environment:
Stars: ✭ 67 (+21.82%)
Mutual labels:  datasets
ml4se
A curated list of papers, theses, datasets, and tools related to the application of Machine Learning for Software Engineering
Stars: ✭ 46 (-16.36%)
Mutual labels:  datasets
SER-datasets
A collection of datasets for the purpose of emotion recognition/detection in speech.
Stars: ✭ 74 (+34.55%)
Mutual labels:  datasets
skippa
SciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (-40%)
Mutual labels:  preprocessing
postcss-each
PostCSS plugin to iterate through values
Stars: ✭ 93 (+69.09%)
Mutual labels:  preprocessing
bnk48 photo datasets
BNK48 Photo Datasets
Stars: ✭ 12 (-78.18%)
Mutual labels:  datasets
dplace-data
The data repository for the D-PLACE Project (Database of Places, Language, Culture and Environment)
Stars: ✭ 49 (-10.91%)
Mutual labels:  datasets
panoptic parts
This repository contains code and tools for reading, processing, evaluating on, and visualizing Panoptic Parts datasets. Moreover, it contains code for reproducing our CVPR 2021 paper results.
Stars: ✭ 82 (+49.09%)
Mutual labels:  datasets
dropEst
Pipeline for initial analysis of droplet-based single-cell RNA-seq data
Stars: ✭ 71 (+29.09%)
Mutual labels:  preprocessing
disent
🧶 Modular VAE disentanglement framework for python built with PyTorch Lightning ▸ Including metrics and datasets ▸ With strongly supervised, weakly supervised and unsupervised methods ▸ Easily configured and run with Hydra config ▸ Inspired by disentanglement_lib
Stars: ✭ 41 (-25.45%)
Mutual labels:  datasets
preprocess-conll05
Scripts for preprocessing the CoNLL-2005 SRL dataset.
Stars: ✭ 17 (-69.09%)
Mutual labels:  preprocessing
opendatasets
A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.
Stars: ✭ 161 (+192.73%)
Mutual labels:  datasets

TakeLab Podium

A framework agnostic Python NLP library for data loading and preprocessing.

Continuous integration License Documentation Release

What is Podium?

Podium is a framework agnostic Python natural language processing library which standardizes data loading and preprocessing. Our goal is to accelerate users' development of NLP models whichever aspect of the library they decide to use.

We desire Podium to be lightweight, in terms of code and dependencies, flexible, to cover most common use-cases and easily adapt to more specific ones and clearly defined, so new users can quickly understand the sequence of operations and how to inject their custom functionality.

Check out our documentation for more details. The main source of inspiration for Podium is an old version of torchtext.

Contents

Installation

Installing from pip

You can install podium using pip

pip install podium-nlp

Installing from source

Commands to install podium from source

git clone [email protected]:TakeLab/podium.git && cd podium
pip install .

For more detailed installation instructions, check the installation page in the documentation.

Usage

Loading datasets

Use some of our pre-defined datasets:

>>> from podium.datasets import SST
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits()
>>> sst_train.finalize_fields() # Trigger vocab construction
>>> print(sst_train)
SST({
    size: 6920,
    fields: [
        Field({
            name: text,
            keep_raw: False,
            is_target: False,
            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 16284})
        }),
        LabelField({
            name: label,
            keep_raw: False,
            is_target: True,
            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})
        })
    ]
})
>>> print(sst_train[222]) # A short example
Example({
    text: (None, ['A', 'slick', ',', 'engrossing', 'melodrama', '.']),
    label: (None, 'positive')
})

Load datasets from 🤗 datasets:

>>> from podium.datasets.hf import HFDatasetConverter as HF
>>> import datasets
>>> # Load the huggingface dataset
>>> imdb = datasets.load_dataset('imdb')
>>> print(imdb.keys())
dict_keys(['train', 'test', 'unsupervised'])
>>> # Wrap it so it can be used in Podium (without being loaded in memory!)
>>> imdb_train, imdb_test, imdb_unsupervised = HF.from_dataset_dict(imdb).values()
>>> # We need to trigger Vocab construction
>>> imdb_train.finalize_fields()
>>> print(imdb_train)
HFDatasetConverter({
    dataset_name: imdb,
    size: 25000,
    fields: [
        Field({
            name: 'text',
            keep_raw: False,
            is_target: False,
            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 280619})
        }),
        LabelField({
            name: 'label',
            keep_raw: False,
            is_target: True
        })
    ]
})

Load your own dataset from a standardized tabular format (e.g. csv, tsv, jsonl, ...):

>>> from podium.datasets import TabularDataset
>>> from podium import Vocab, Field, LabelField
>>> fields = {'premise':   Field('premise', numericalizer=Vocab()),
...           'hypothesis':Field('hypothesis', numericalizer=Vocab()),
...           'label':     LabelField('label')}
>>> dataset = TabularDataset('my_dataset.csv', format='csv', fields=fields)
>>> dataset.finalize_fields() # Trigger vocab construction
>>> print(dataset)
TabularDataset({
    size: 1,
    fields: [
        Field({
            name: 'premise',
            keep_raw: False,
            is_target: False, 
            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 15})
        }),
        Field({
            name: 'hypothesis',
            keep_raw: False,
            is_target: False, 
            vocab: Vocab({specials: ('<UNK>', '<PAD>'), eager: False, is_finalized: True, size: 6})
        }),
        LabelField({
            name: 'label',
            keep_raw: False,
            is_target: True, 
            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})
        })
    ]
})

Check our documentation to see how you can load a dataset from Pandas, the CoNLL format, or define your own Dataset subclass (tutorial coming soon).

Define your preprocessing

We wrap dataset pre-processing in customizable Field classes. Each Field has an optional Vocab instance which automatically handles token-to-index conversion.

>>> from podium import Vocab, Field, LabelField
>>> vocab = Vocab(max_size=5000, min_freq=2)
>>> text = Field(name='text', numericalizer=vocab)
>>> label = LabelField(name='label')
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(vocab)
Vocab({specials: ('<UNK>', '<PAD>'), eager: False, finalized: True, size: 5000})

Each Field allows the user full flexibility to modify the data in multiple stages:

  • Prior to tokenization (by using pre-tokenization hooks)
  • During tokenization (by using your own tokenizer)
  • Post tokenization (by using post-tokenization hooks)

You can also completely disregard our preprocessing and define your own by setting your own numericalizer.

You could decide to lowercase all the characters and filter out all non-alphanumeric tokens:

>>> def lowercase(raw):
...     return raw.lower()
>>> def filter_alnum(raw, tokenized):
...     filtered_tokens = [token for token in tokenized if
...                        any([char.isalnum() for char in token])]
...     return raw, filtered_tokens
>>> text.add_pretokenize_hook(lowercase)
>>> text.add_posttokenize_hook(filter_alnum)
>>> fields = {'text': text, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> sst_train.finalize_fields()
>>> print(sst_train[222])
Example({
    text: (None, ['a', 'slick', 'engrossing', 'melodrama']),
    label: (None, 'positive')
})

Pre-tokenization hooks accept and modify only on raw data. Post-tokenization hooks accept and modify raw and tokenized data.

Use preprocessing from other libraries

A common use-case is to incorporate existing components of pretrained language models, such as BERT. This is extremely simple to incorporate as part of our Fields. This snippet requires installation of the 🤗 transformers (pip install transformers) library.

>>> from transformers import BertTokenizer
>>> # Load the tokenizer and fetch pad index
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
>>> # Define a BERT subword Field
>>> subword_field = Field(name="subword",
...                       padding_token=pad_index,
...                       tokenizer=tokenizer.tokenize,
...                       numericalizer=tokenizer.convert_tokens_to_ids)
>>> fields = {'text': subword_field, 'label': label}
>>> sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)
>>> # No need to finalize since we're not using a vocab!
>>> print(sst_train[222])
Example({
    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),
    label: (None, 'positive')
})

For a more interactive introduction, check out the quickstart on Google Colab: Open In Colab

Full usage examples can be found in our docs under the Examples heading.

Contributing

We welcome contributions! To learn more about making a contribution to Podium, please see our Contribution page and our Roadmap.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

See also the list of contributors who participated in this project.

Citation

If you are using Podium, please cite the following entry in your work:

@misc{tutek-etal-2021-podium,
  author = {Martin Tutek and Filip Boltužić and Ivan Smoković and Mario Šaško and Silvije Škudar and Domagoj Pluščec and Marin Kačan and Dunja Vesinger and Mate Mijolović and Jan Šnajder},
  title = {Podium: a framework-agnostic NLP preprocessing toolkit},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TakeLab/podium}},
  commit = {4fed78b8d8366768df10454b8368f416a3305cc4}
}

License

This project is licensed under the BSD 3-Clause - see the LICENSE file for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].