All Projects → OctoberChang → X-Transformer

OctoberChang / X-Transformer

Licence: BSD-3-Clause license
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to X-Transformer

text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (-74.8%)
Mutual labels:  text-classification, transformers
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-76.38%)
Mutual labels:  text-classification, transformers
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-81.1%)
Mutual labels:  text-classification, transformers
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+80.31%)
Mutual labels:  text-classification, transformers
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-82.68%)
Mutual labels:  text-classification, transformers
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+18.9%)
Mutual labels:  text-classification, transformers
small-text
Active Learning for Text Classification in Python
Stars: ✭ 241 (+89.76%)
Mutual labels:  text-classification, transformers
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (-33.07%)
Mutual labels:  text-classification, transformers
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-88.19%)
Mutual labels:  text-classification, transformers
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-73.23%)
Mutual labels:  text-classification, transformers
Simpletransformers
Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
Stars: ✭ 2,881 (+2168.5%)
Mutual labels:  text-classification, transformers
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1882.68%)
Mutual labels:  text-classification, transformers
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (-19.69%)
Mutual labels:  text-classification, transformers
KnowledgeEditor
Code for Editing Factual Knowledge in Language Models
Stars: ✭ 86 (-32.28%)
Mutual labels:  transformers
character-level-cnn
Keras implementation of Character-level CNN for Text Classification
Stars: ✭ 56 (-55.91%)
Mutual labels:  text-classification
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (-27.56%)
Mutual labels:  text-classification
clip-italian
CLIP (Contrastive Language–Image Pre-training) for Italian
Stars: ✭ 113 (-11.02%)
Mutual labels:  transformers
ginza-transformers
Use custom tokenizers in spacy-transformers
Stars: ✭ 15 (-88.19%)
Mutual labels:  transformers
nlp workshop odsc europe20
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…
Stars: ✭ 127 (+0%)
Mutual labels:  transformers
Basic-UI-for-GPT-J-6B-with-low-vram
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.
Stars: ✭ 90 (-29.13%)
Mutual labels:  transformers

Taming Pretrained Transformers for XMC problems

This is a README for the experimental code of the following paper

Taming Pretrained Transformers for eXtreme Multi-label Text Classification

Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon

KDD 2020

Upates (2021-04-27)

Latest implementation (faster training with stronger performance) of X-Transformer is available at PECOS, feel free to try it out!

Installation

Depedencies via Conda Environment

> conda env create -f environment.yml
> source activate pt1.2_xmlc_transformer
> (pt1.2_xmlc_transformer) pip install -e .
> (pt1.2_xmlc_transformer) python setup.py install --force

**Notice: the following examples are executed under the > (pt1.2_xmlc_transformer) conda virtual environment

Reproduce Evaulation Results in the Paper

We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.

Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)

Change directory into ./datasets folder, download and unzip each dataset

cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../

Each dataset contains the following files

  • label_map.txt: each line is the raw text of the label
  • train_raw_text.txt, test_raw_text.txt: each line is the raw text of the instance
  • X.trn.npz, X.tst.npz: instance's embedding matrix (either sparse TF-IDF or fine-tuned dense embedding)
  • Y.trn.npz, Y.tst.npz: instance-to-label assignment matrix

Download Pretrained Models (processed data, Indexing codes, fine-tuned Transformer models)

Change directory into ./pretrained_models folder, download and unzip models for each dataset

cd ./pretrained_models
bash download-models.sh Eurlex-4K
bash download-models.sh Wiki10-31K
bash download-models.sh AmazonCat-13K
bash download-models.sh Wiki-500K
cd ../

Each folder has the following strcture

  • proc_data: a sub-folder containing: X.{trn|tst}.{model}.128.pkl, C.{label-emb}.npz, L.{label-emb}.npz
  • pifa-tfidf-s0: a sub-folder containing indexer and matcher
  • pifa-neural-s0: a sub-folder containing indexer and matcher
  • text-emb-s0: a sub-folder containing indexer and matcher

Evaluate Linear Models

Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:

bash eval_linear.sh ${DATASET} ${VERSION}
  • DATASET: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.
  • VERSION: v0=sparse TF-IDF features. v1=sparse TF-IDF features concatenate with dense fine-tuned XLNet embedding.

The evaluaiton results should located at ./results_linear/${DATASET}.${VERSION}.txt

Evaluate Fine-tuned X-Transformer Models

Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:

bash eval_transformer.sh ${DATASET}
  • DATASET: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.

The evaluaiton results should located at ./results_transformer/${DATASET}.final.txt

Running X-Transformer on customized datasets

The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type). For simplicity, we show you 1 out-of 9 here, using LABEL_EMB=pifa-tfidf and MODEL_TYPE=bert.

We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:

  • X.trn.npz: the instance TF-IDF feature matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, D_tfidf), where N_trn is the number of train instances and D_tfidf is the number of features.
  • X.tst.npz: the instance TF-IDF feature matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, D_tfidf), where N_tst is the number of test instances and D_tfidf is the number of features.
  • Y.trn.npz: the instance-to-label matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, L), where n_trn is the number of train instances and L is the number of labels.
  • Y.tst.npz: the instance-to-label matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, L), where n_tst is the number of test instances and L is the number of labels.
  • train_raw_texts.txt: The raw text of the train set.
  • test_raw_texts.txt: The raw text of the test set.
  • label_map.txt: the label's text description.

Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker.

Indexer

In stage 1, we will do the following

  • (1) construct label embedding
  • (2) perform hierarchical 2-means and output the instance-to-cluster assignment matrix
  • (3) preprocess the input and output for training Transformer models.

TLDR: we combine and summarize (1),(2),(3) into two scripts: run_preprocess_label.sh and run_preprocess_feat.sh. See more detailed explaination in the following.

(1) To construct label embedding,

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
mkdir -p ${PROC_DATA_DIR}
python -m xbert.preprocess \
    --do_label_embedding \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB} \
    -x ${LABEL_EMB_INST_PATH}
  • DATA_DIR: ./datasets/Eurlex-4K
  • PROC_DATA_DIR: ./save_models/Eurlex-4K/proc_data
  • LABEL_EMB: pifa-tfidf (you can also try text-emb or pifa-neural if you have fine-tuned instance embeddings)
  • LABEL_EMB_INST_PATH: ./datasets/Eurlex-4K/X.trn.npz

This should yield L.${LABEL_EMB}.npz in the PROC_DATA_DIR.

(2) To perform hierarchical 2-means,

SEED_LIST=( 0 1 2 )
for SEED in "${SEED_LIST[@]}"; do
    LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
    INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
    python -u -m xbert.indexer \
    python -m xbert.preprocess \
        -i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \
        -o ${INDEXER_DIR} --seed ${SEED}

This should yield code.npz in the INDEXIER_DIR.

(3) To preprocess input and output for Transformer models,

SEED=0
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.preprocess \
    --do_proc_label \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB_NAME} \
    -c ${INDEXER_DIR}/code.npz

This should yield the instance-to-cluster matrix C.trn.npz and C.tst.npz in the PROC_DATA_DIR.

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
python -u -m xbert.preprocess \
    --do_proc_feat \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -m ${MODEL_TYPE} \
    -n ${MODEL_NAME} \
    --max_xseq_len ${MAX_XSEQ_LEN} \
    |& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt
  • MODEL_TYPE: bert (or roberta, xlnet)
  • MODEL_NAME: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)
  • MAX_XSEQ_LEN: maximum number of tokens, we set to 128

This should yield X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt and X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt in the PROC_DATA_DIR.

Matcher

In stage 2, we will do the following

  • (1) train deep Transformer models to map instances to the induced clusters
  • (2) output the predicted cluster scores and fine-tune instance embeddings

TLDR: run_transformer_train.sh. See more detailed explaination in the following.

(1) Assume we have 8 Nvidia V100 GPUs. To train the models,

MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}
mkdir -p ${MODEL_DIR}
python -m torch.distributed.launch \
    --nproc_per_node 8 xbert/transformer.py \
    -m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -o ${MODEL_DIR} --overwrite_output_dir \
    --per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \
    --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
    --max_steps ${MAX_STEPS} \
    --warmup_steps ${WARMUP_STEPS} \
    --learning_rate ${LEARNING_RATE} \
    --logging_steps ${LOGGING_STEPS} \
    |& tee ${MODEL_DIR}/log.txt
  • MODEL_TYPE: bert (or roberta, xlnet)
  • MODEL_NAME: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)
  • PER_DEVICE_TRN_BSZ: 16 if using Nvidia V100 (or set to 8 if using Nvidia 2080Ti)
  • GRAD_ACCU_STEPS: 2 if using Nvidia V100 (or set to 4 if using Nvidia 2080Ti)
  • MAX_STEPS: set to 1,000 for Eurlex-4K. Depending on your datasets
  • WARMUP_STEPS: set to 1,00 for Eurlex-4K. Depending on your datasets
  • LEARNING_RATE: set to 5e-5 for Eurlex-4K. Depending on your datasets
  • LOGGING_STEPS: set to 100

(2) To generate predictions and instance embedding,

GPID=0,1,2,3,4,5,6,7
PER_DEVICE_VAL_BSZ=32
CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py
    -m ${MODEL_TYPE} -n ${MODEL_NAME} \
    --do_eval -o ${MODEL_DIR} \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \
    --per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}

This should yield the following output in the MODEL_DIR

  • C_trn_pred.npz and C_tst_pred.npz: model-predicted cluster scores
  • trn_embeddings.npy and tst_embeddings.npy: fine-tuned instance embeddings

Ranker

In stage 3, we will do the following

  • (1) train linear rankers to map instances and predicted cluster scores to label scores
  • (2) output top-k predicted labels

TLDR: run_transformer_predict.sh. See more detailed explaination in the following.

(1) To train linear rankers,

LABEL_NAME=pifa-tfidf-s0
MODEL_NAME=bert-large-cased-whole-word-masking
OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}
INDEXER_DIR=${OUTPUT_DIR}/indexer
MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}
RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}
mkdir -p ${RANKER_DIR}
python -m xbert.ranker train \
    -x1 ${DATA_DIR}/X.trn.npz \
    -x2 ${MATCHER_DIR}/trn_embeddings.npy \
    -y ${DATA_DIR}/Y.trn.npz \
    -z ${MATCHER_DIR}/C_trn_pred.npz \
    -c ${INDEXER_DIR}/code.npz \
    -o ${RANKER_DIR} -t 0.01 \
    -f 0 --mode ranker

(2) To predict the final top-k labels,

PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz
python -m xbert.ranker predict \
    -m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \
    -x1 ${DATA_DIR}/X.tst.npz \
    -x2 ${MATCHER_DIR}/tst_embeddings.npy \
    -y ${DATA_DIR}/Y.tst.npz \
    -z ${MATCHER_DIR}/C_tst_pred.npz \
    -f 0 -t noop

This should yield the predicted top-k labels tst.pred.npz specified in PRED_NPZ_PATH.

Acknowledge

Some portions of this repo is borrowed from the following repos:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].