Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations

The repository contains data and links to model implementations and training/test code for the paper Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations. The paper investigates how increases in pretraining data alters the inductive biases of RoBERTa when generalizing on downstream tasks. We pretrain models on 4 successively larger datasets, then test them on a synthetic dataset named Mixed Signals Generalization Set (MSGS).

Sections

Pretraining RoBERTa

We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1. We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens):

model	data	model size	max steps	batch size	val. ppl.
roberta-base-1B-1	1B	BASE	100K	512	3.93
roberta-base-1B-2	1B	BASE	31K	1024	4.25
roberta-base-1B-3	1B	BASE	31K	4096	3.84
roberta-base-100M-1	100M	BASE	100K	512	4.99
roberta-base-100M-2	100M	BASE	31K	1024	4.61
roberta-base-100M-3	100M	BASE	31K	512	5.02
roberta-base-10M-1	10M	BASE	10K	1024	11.31
roberta-base-10M-2	10M	BASE	10K	512	10.78
roberta-base-10M-3	10M	BASE	31K	512	11.58
roberta-med-small-1M-1	1M	MED-SMALL	100K	512	153.38
roberta-med-small-1M-2	1M	MED-SMALL	10K	512	134.18
roberta-med-small-1M-3	1M	MED-SMALL	31K	512	139.39

Details on model sizes (see the paper for a full discussion of how we tune this hyperparameter):

Model Size	L	AH	HS	FFN	P
BASE	12	12	768	3072	125M
MED-SMALL	6	8	512	2048	45M

(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)

For other hyperparameters, we select:

Peak Learning rate: 5e-4
Warmup Steps: 6% of maximum steps passing model to data batches.
Dropout: 0.1

An example of how to run the pretraining:

Our pretraining code is based on Fairseq. To reproduce the pretraining of roberta-med-small-1M-1, use the following commands:

PYTHONPATH=./fairseq

TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
DATA_DIR=PATH/TO/YOUR/DATA/

TOTAL_UPDATES=100000    # Total number of training steps
WARMUP_UPDATES=6000     # Warmup the learning rate over this many updates
PEAK_LR=0.0005        # Peak learning rate, adjust as needed
UPDATE_FREQ=8          # Increase the batch size 8x
SAVE_DIR=miniberta_1M_reproduce_checkpoints

python fairseq/fairseq_cli/train.py --fp16 $DATA_DIR     --task masked_lm --criterion masked_lm     --arch roberta_med_small --sample-break-mode complete \
--tokens-per-sample $TOKENS_PER_SAMPLE     --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0     --lr-scheduler polynomial_decay \
--lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES     --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ     --max-update $TOTAL_UPDATES --log-format simple --log-interval 1     --save-dir $SAVE_DIR \
--skip-invalid-size-inputs-valid-test     --patience 100     --no-epoch-checkpoints

The commands above are suitable if you are running the job with 4 GPUs. If you are using more/fewer GPUs, make sure UPDATE_FREQ*#GPUs=32.

Analysis on Pretrained models

We have also analyzed our pretrained models with edge probing (Tenney et al., 2019). For edge probing details you can go to our post on CILVR blog: The MiniBERTas: Testing what RoBERTa learns with varying amounts of pretraining.

Testing Inductive Biases with MSGS

The MSGS dataset includes data for 29 binary classification tasks to test models' inductive biases. MSGS contains data for 20 ambiguous tasks obtained by combining one of 4 linguistic features with one of 5 surface features. We also provide data for 9 unambiguous control tasks for each of the 9 features.

MSGS

More details are provided in the data page.

Model and Finetuning Tutorials

Use Pretrained Models

The RoBERTa models pretrained on smaller datasets ("MiniBERTas") are available through Hugging Face.

You can load models as in the following example:

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nyu-mll/roberta-base-100M-1")
model = AutoModel.from_pretrained("nyu-mll/roberta-base-100M-1")

Finetuning on MSGS

You can use our fork of transformers and run ./examples/run_msgs.py in msgs branch. An example is presented below:

(Note: Currently data in the data page doesn't include training sets that contain inoculation data. You can still create these inoculated datasets, or you can use access this page to get the original inoculated data from the paper.)

python ./examples/run_msgs.py \
    --model_type roberta \
    --model_name_or_path #path/to/your/model \
    --task_name #task_name \
    --do_train \
    --do_eval \
    --data_dir #path/to/data \
    --max_seq_length #max_seq_len \
    --per_gpu_eval_batch_size #batch_size \
    --per_gpu_train_batch_size #batch_size \
    --learning_rate #lr \
    --num_train_epochs #num_epochs  \
    --weight_decay #weight_decay \
    --warmup_steps #warmup_steps \
    --logging_steps #logging_steps \
    --save_steps #save_steps \
    --output_dir #path/to/save/finetuned/models

define model_name

You can check out our model page. You can either use the names specified on Transformers or download all files and specify the local directory.

define task_name

control task: [feature]_control.
mixed binary classification task: [linguistic_feature]_[surface_feature]_[inoculate%]

[inoculate%]: you can choose one from namb, 001, 003, and 01. namb means no inoculation.

[linguistic_feature]&[surface_feature]: you need to change intended feature types to their corresponding names presented below:

feature type	corresponding name
Absolute position (surface)	absolute_token_position
Length (surface)	length
Lexical content (surface)	lexical_content_the
Relative position (surface)	relative_token_position
Orthography (surface)	title_case
Morphology (linguistic)	irregular_form
Syn. category (linguistic)	syntactic_category
Syn. construction (linguistic)	control_rasing
Syn. position (linguistic)	main_verb

Citing

@inproceedings{warstadt-etal-2020-learning,
    title = "Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)",
    author = "Warstadt, Alex and
	      Zhang, Yian and
              Li, Haau-Sing and
              Liu, Haokun and
              Bowman, Samuel R",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

nyu-mll / msgs

Programming Languages