TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.

The main contributions include:

Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from.
Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application.
Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user.
Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization.
GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU.

TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.

Quick Start

# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions

# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]

# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))

# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])

# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))

# Model: BERT with the pretrained bert-base-uncased embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))

# Features
vecs = emb.fit_transform(documents)

Available Embeddings

Embeddings	Notes
Bag of Words (BoW)	Supported by `scikit-learn` Defaults to training from scratch
Term Frequency Inverse Document Frequency (TfIdf)	Supported by `scikit-learn` Defaults to training from scratch
Document Embeddings (Doc2Vec)	Supported by `gensim` Defaults to training from scratch
Universal Sentence Encoder (USE)	Supported by `tensorflow`, see requirements Defaults to large v5
Compound Embedding	Supported by a context-free grammar
Word Embedding: Word2Vec	Supported by these pretrained embeddings Common pretrained options include `crawl`, `glove`, `extvec`, `twitter`, and `en-news` When the pretrained option is `None`, trains a new model from the given data Defaults to `en`, FastText embeddings trained on news
Word Embedding: Character	Initialized randomly and not pretrained Useful when trained for a downstream task Enable fine-tuning to get good embeddings
Word Embedding: BytePair	Supported by these pretrained embeddings Pretrained options can be specified with the string `<lang>_<dim>_<vocab_size>` Default options can be omitted like `en`, `en_100`, or `en__10000` Defaults to `en`, which is equal to `en_100_10000`
Word Embedding: ELMo	Supported by these pretrained embeddings from AllenNLP Defaults to `original`
Word Embedding: Flair	Supported by these pretrained embeddings Defaults to `news-forward-fast`
Word Embedding: BERT	Supported by these pretrained embeddings Defaults to `bert-base-uncased`
Word Embedding: OpenAI GPT	Supported by these pretrained embeddings Defaults to `openai-gpt`
Word Embedding: OpenAI GPT2	Supported by these pretrained embeddings Defaults to `gpt2-medium`
Word Embedding: TransformerXL	Supported by these pretrained embeddings Defaults to `transfo-xl-wt103`
Word Embedding: XLNet	Supported by these pretrained embeddings Defaults to `xlnet-large-cased`
Word Embedding: XLM	Supported by these pretrained embeddings Defaults to `xlm-mlm-en-2048`
Word Embedding: RoBERTa	Supported by these pretrained embeddings Defaults to `roberta-base`
Word Embedding: DistilBERT	Supported by these pretrained embeddings Defaults to `distilbert-base-uncased`
Word Embedding: CTRL	Supported by these pretrained embeddings Defaults to `ctrl`
Word Embedding: ALBERT	Supported by these pretrained embeddings Defaults to `albert-base-v2`
Word Embedding: T5	Supported by these pretrained embeddings Defaults to `t5-base`
Word Embedding: XLM-RoBERTa	Supported by these pretrained embeddings Defaults to `xlm-roberta-base`
Word Embedding: BART	Supported by these pretrained embeddings Defaults to `facebook/bart-base`
Word Embedding: ELECTRA	Supported by these pretrained embeddings Defaults to `google/electra-base-generator`
Word Embedding: DialoGPT	Supported by these pretrained embeddings Defaults to `microsoft/DialoGPT-small`
Word Embedding: Longformer	Supported by these pretrained embeddings Defaults to `allenai/longformer-base-4096`

Available Transformations

Transformations	Notes
Singular Value Decomposition (SVD)	Differentiable
Latent Dirichlet Allocation (LDA)	Not differentiable
Non-negative Matrix Factorization (NMF)	Not differentiable
Uniform Manifold Approximation and Projection (UMAP)	Not differentiable
Pooling Word Vectors	Applies to word embeddings only Reduces word-level vectors to document-level Pool options include `max`, `min`, `mean`, `first`, and `last` Defaults to `max`

Usage Examples

Examples can be found under the notebooks folder.

Installation

TextWiser requires Python 3.6+ and can be installed from PyPI using pip install textwiser, using pip install textwiser[full] to install from PyPI with all optional dependencies, or by building from source by following the instructions in our documentation.

Compound Embedding

A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding.

This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization. You can see the details in our documentation and in the usage example.

Fine-Tuning for Downstream Tasks

All Word2Vec and transformer-based embeddings and any embedding followed with an svd transformation are fine-tunable for downstream tasks. In other words, if you pass the resulting fine-tunable embedding to a PyTorch training method, the features will automatically be trained for your application. You can see the details in our documentation and in the usage example.

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser. Customized tokenization is also supported as described in more detail in our documentation

Support

Please submit bug reports, questions and feature requests as Issues.

Citation

If you use TextWiser in a publication, please cite it as:

  @article{textwiser2021,
    author={Kilitcioglu, Doruk and Kadioglu, Serdar},
    title={Representing the Unification of Text Featurization using a Context-Free Grammar},
    url={https://github.com/fidelity/textwiser},
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    number={17},
    year={2021},
    month={May},
    pages={15439-15445}
  }

License

TextWiser is licensed under the Apache License 2.0.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fidelity / textwiser

Programming Languages

Labels

Projects that are alternatives of or similar to textwiser