All Projects → fidelity → textwiser

fidelity / textwiser

Licence: Apache-2.0 license
[AAAI 2021] TextWiser: Text Featurization Library

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to textwiser

consistency
Implementation of models in our EMNLP 2019 paper: A Logic-Driven Framework for Consistency of Neural Models
Stars: ✭ 26 (+0%)
Mutual labels:  bert
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (+111.54%)
Mutual labels:  bert
BERT-embedding
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow
Stars: ✭ 24 (-7.69%)
Mutual labels:  bert
BERT-BiLSTM-CRF
BERT-BiLSTM-CRF的Keras版实现
Stars: ✭ 40 (+53.85%)
Mutual labels:  bert
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (+388.46%)
Mutual labels:  bert
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-34.62%)
Mutual labels:  bert
Transformers-Tutorials
This repository contains demos I made with the Transformers library by HuggingFace.
Stars: ✭ 2,828 (+10776.92%)
Mutual labels:  bert
JD2Skills-BERT-XMLC
Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework
Stars: ✭ 33 (+26.92%)
Mutual labels:  bert
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: ✭ 75 (+188.46%)
Mutual labels:  bert
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (+115.38%)
Mutual labels:  bert
bert nli
A Natural Language Inference (NLI) model based on Transformers (BERT and ALBERT)
Stars: ✭ 97 (+273.08%)
Mutual labels:  bert
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (+0%)
Mutual labels:  bert
rasa milktea chatbot
Chatbot with bert chinese model, base on rasa framework(中文聊天机器人,结合bert意图分析,基于rasa框架)
Stars: ✭ 97 (+273.08%)
Mutual labels:  bert
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (+46.15%)
Mutual labels:  bert
mirror-bert
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.
Stars: ✭ 56 (+115.38%)
Mutual labels:  bert
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+226.92%)
Mutual labels:  bert
bert-tensorflow-pytorch-spacy-conversion
Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through uses DeepPavlov's RuBERT as example.
Stars: ✭ 26 (+0%)
Mutual labels:  bert
tfbert
基于tensorflow1.x的预训练模型调用,支持单机多卡、梯度累积,XLA加速,混合精度。可灵活训练、验证、预测。
Stars: ✭ 54 (+107.69%)
Mutual labels:  bert
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-7.69%)
Mutual labels:  bert
task-transferability
Data and code for our paper "Exploring and Predicting Transferability across NLP Tasks", to appear at EMNLP 2020.
Stars: ✭ 35 (+34.62%)
Mutual labels:  bert

ci PyPI version fury.io PyPI license PRs Welcome Downloads

TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.

The main contributions include:

  • Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from.
  • Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application.
  • Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user.
  • Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization.
  • GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU.

TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.

Quick Start

# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions

# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]

# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))

# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])

# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))

# Model: BERT with the pretrained bert-base-uncased embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))

# Features
vecs = emb.fit_transform(documents)

Available Embeddings

Embeddings Notes
Bag of Words (BoW) Supported by scikit-learn
Defaults to training from scratch
Term Frequency Inverse Document Frequency (TfIdf) Supported by scikit-learn
Defaults to training from scratch
Document Embeddings (Doc2Vec) Supported by gensim
Defaults to training from scratch
Universal Sentence Encoder (USE) Supported by tensorflow, see requirements
Defaults to large v5
Compound Embedding Supported by a context-free grammar
Word Embedding: Word2Vec Supported by these pretrained embeddings
Common pretrained options include crawl, glove, extvec, twitter, and en-news
When the pretrained option is None, trains a new model from the given data
Defaults to en, FastText embeddings trained on news
Word Embedding: Character Initialized randomly and not pretrained
Useful when trained for a downstream task
Enable fine-tuning to get good embeddings
Word Embedding: BytePair Supported by these pretrained embeddings
Pretrained options can be specified with the string <lang>_<dim>_<vocab_size>
Default options can be omitted like en, en_100, or en__10000
Defaults to en, which is equal to en_100_10000
Word Embedding: ELMo Supported by these pretrained embeddings from AllenNLP
Defaults to original
Word Embedding: Flair Supported by these pretrained embeddings
Defaults to news-forward-fast
Word Embedding: BERT Supported by these pretrained embeddings
Defaults to bert-base-uncased
Word Embedding: OpenAI GPT Supported by these pretrained embeddings
Defaults to openai-gpt
Word Embedding: OpenAI GPT2 Supported by these pretrained embeddings
Defaults to gpt2-medium
Word Embedding: TransformerXL Supported by these pretrained embeddings
Defaults to transfo-xl-wt103
Word Embedding: XLNet Supported by these pretrained embeddings
Defaults to xlnet-large-cased
Word Embedding: XLM Supported by these pretrained embeddings
Defaults to xlm-mlm-en-2048
Word Embedding: RoBERTa Supported by these pretrained embeddings
Defaults to roberta-base
Word Embedding: DistilBERT Supported by these pretrained embeddings
Defaults to distilbert-base-uncased
Word Embedding: CTRL Supported by these pretrained embeddings
Defaults to ctrl
Word Embedding: ALBERT Supported by these pretrained embeddings
Defaults to albert-base-v2
Word Embedding: T5 Supported by these pretrained embeddings
Defaults to t5-base
Word Embedding: XLM-RoBERTa Supported by these pretrained embeddings
Defaults to xlm-roberta-base
Word Embedding: BART Supported by these pretrained embeddings
Defaults to facebook/bart-base
Word Embedding: ELECTRA Supported by these pretrained embeddings
Defaults to google/electra-base-generator
Word Embedding: DialoGPT Supported by these pretrained embeddings
Defaults to microsoft/DialoGPT-small
Word Embedding: Longformer Supported by these pretrained embeddings
Defaults to allenai/longformer-base-4096

Available Transformations

Transformations Notes
Singular Value Decomposition (SVD) Differentiable
Latent Dirichlet Allocation (LDA) Not differentiable
Non-negative Matrix Factorization (NMF) Not differentiable
Uniform Manifold Approximation and Projection (UMAP) Not differentiable
Pooling Word Vectors Applies to word embeddings only
Reduces word-level vectors to document-level
Pool options include max, min, mean, first, and last
Defaults to max

Usage Examples

Examples can be found under the notebooks folder.

Installation

TextWiser requires Python 3.6+ and can be installed from PyPI using pip install textwiser, using pip install textwiser[full] to install from PyPI with all optional dependencies, or by building from source by following the instructions in our documentation.

Compound Embedding

A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding.

This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization. You can see the details in our documentation and in the usage example.

Fine-Tuning for Downstream Tasks

All Word2Vec and transformer-based embeddings and any embedding followed with an svd transformation are fine-tunable for downstream tasks. In other words, if you pass the resulting fine-tunable embedding to a PyTorch training method, the features will automatically be trained for your application. You can see the details in our documentation and in the usage example.

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser. Customized tokenization is also supported as described in more detail in our documentation

Support

Please submit bug reports, questions and feature requests as Issues.

Citation

If you use TextWiser in a publication, please cite it as:

  @article{textwiser2021,
    author={Kilitcioglu, Doruk and Kadioglu, Serdar},
    title={Representing the Unification of Text Featurization using a Context-Free Grammar},
    url={https://github.com/fidelity/textwiser},
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    number={17},
    year={2021},
    month={May},
    pages={15439-15445}
  }

License

TextWiser is licensed under the Apache License 2.0.


Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].