LexGLUE: A Benchmark Dataset for Legal Language Understanding in English ⚖️ 🏆 🧑‍🎓 👩‍⚖️

📣 🚨 Important Notice related to the EUR-LEX dataset (Fixed) 🐛 👈

There was a major bug in HuggingFace data loader for the EUR-LEX task, which affected the label list under consideration in the training script. In the original experiments for the reported leaderboard we used custom data loaders, and then we built and released the HuggingFace dataset and data loader w/o noticing this “stealthy” bug. In other words, the leaderboard results are reliable.

The 🐛 has been already fixed, so you can continue developing models seamlessly. Make sure to update the HF Datasets library and clear the cache, in case there are cached versions of the dataset:

pip install --upgrade datasets
rm -rf  ~/.cache/huggingface/datasets/lex_glue

Thanks to @JamesLYC88 for digging up the 🐛, and sorry for the inconvenience! 🤗

Dataset Summary

Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2109), other previous multi-task NLP benchmarks (Conneau and Kiela,2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce LexGLUE, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE.

We anticipate that more datasets, tasks, and languages will be added in later versions of LexGLUE. As more legal NLP datasets become available, we also plan to favor datasets checked thoroughly for validity (scores reflecting real-life performance), annotation quality, statistical power,and social bias (Bowman and Dahl, 2021).

As in GLUE and SuperGLUE (Wang et al., 2109) one of our goals is to push towards generic (or foundation) models that can cope with multiple NLP tasks, in our case legal NLP tasks,possibly with limited task-specific fine-tuning. An-other goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways, discussed below, to make it easier for newcomers and generic models to address all tasks. We provide PythonAPIs integrated with Hugging Face (Wolf et al.,2020; Lhoest et al., 2021) to easily import all the datasets, experiment with and evaluate their performance.

By unifying and facilitating the access to a set of law-related datasets and tasks, we hope to attract not only more NLP experts, but also more interdisciplinary researchers (e.g., law doctoral students willing to take NLP courses). More broadly, we hope LexGLUE will speed up the adoption and transparent evaluation of new legal NLP methods and approaches in the commercial sector too. Indeed, there have been many commercial press releases in legal-tech industry, but almost no independent evaluation of the veracity of the performance of various machine learning and NLP-based offerings. A standard publicly available benchmark would also allay concerns of undue influence in predictive models, including the use of metadata which the relevant law expressly disregards.

If you participate, use the LexGLUE benchmark, or our experimentation library, please cite:

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. 2022. In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland.

@inproceedings{chalkidis-etal-2022-lexglue,
    title = "{L}ex{GLUE}: A Benchmark Dataset for Legal Language Understanding in {E}nglish",
    author = "Chalkidis, Ilias  and
      Jana, Abhik  and
      Hartung, Dirk  and
      Bommarito, Michael  and
      Androutsopoulos, Ion  and
      Katz, Daniel  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.297",
    pages = "4310--4330",
}

Supported Tasks

Dataset	Source	Sub-domain	Task Type	Train/Dev/Test Instances	Classes
ECtHR (Task A)	Chalkidis et al. (2019)	ECHR	Multi-label classification	9,000/1,000/1,000	10+1
ECtHR (Task B)	Chalkidis et al. (2021a)	ECHR	Multi-label classification	9,000/1,000/1,000	10+1
SCOTUS	Spaeth et al. (2020)	US Law	Multi-class classification	5,000/1,400/1,400	14
EUR-LEX	Chalkidis et al. (2021b)	EU Law	Multi-label classification	55,000/5,000/5,000	100
LEDGAR	Tuggener et al. (2020)	Contracts	Multi-class classification	60,000/10,000/10,000	100
UNFAIR-ToS	Lippi et al. (2019)	Contracts	Multi-label classification	5,532/2,275/1,607	8+1
CaseHOLD	Zheng et al. (2021)	US Law	Multiple choice QA	45,000/3,900/3,900	n/a

ECtHR (Task A)

The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any).

ECtHR (Task B)

SCOTUS

The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute).

EUR-LEX

European Union (EU) legislation is published in EUR-Lex portal. All EU laws are annotated by EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. The current version of EuroVoc contains more than 7k concepts referring to various activities of the EU and its Member States (e.g., economics, health-care, trade). Given a document, the task is to predict its EuroVoc labels (concepts).

LEDGAR

LEDGAR dataset aims contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.

UNFAIR-ToS

The UNFAIR-ToS dataset contains 50 Terms of Service (ToS) from on-line platforms (e.g., YouTube, Ebay, Facebook, etc.). The dataset has been annotated on the sentence-level with 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to the European consumer law.

CaseHOLD

The CaseHOLD (Case Holdings on Legal Decisions) dataset includes multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings accompany referenced decisions relevant for the present case. The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, while the holding statement is masked out. The model must identify the correct (masked) holding statement from a selection of five choices.

Leaderboard

Averaged LexGLUE Scores

We report the arithmetic, harmonic, and geometric mean across tasks following Shavrina and Malykh (2021). We acknowledge that the use of scores aggregated over tasks has been criticized in general NLU benchmarks (e.g., GLUE), as models are trained with different numbers of samples, task complexity, and evaluation metrics per task. We believe that the use of a standard common metric (F1) across tasks and averaging with harmonic mean alleviate this issue.

Averaging	Arithmetic	Harmonic	Geometric
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
BERT	77.8 / 69.5	76.7 / 68.2	77.2 / 68.8
RoBERTa	77.8 / 68.7	76.8 / 67.5	77.3 / 68.1
RoBERTa (Large)	79.4 / 70.8	78.4 / 69.1	78.9 / 70.0
DeBERTa	78.3 / 69.7	77.4 / 68.5	77.8 / 69.1
Longformer	78.5 / 70.5	77.5 / 69.5	78.0 / 70.0
BigBird	78.2 / 69.6	77.2 / 68.5	77.7 / 69.0
Legal-BERT	79.8 / 72.0	78.9 / 70.8	79.3 / 71.4
CaseLaw-BERT	79.4 / 70.9	78.5 / 69.7	78.9 / 70.3

Task-wise LexGLUE scores

Large-sized (👴) Models [1]

Dataset	ECtHR A	ECtHR B	SCOTUS	EUR-LEX	LEDGAR	UNFAIR-ToS	CaseHOLD
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
RoBERTa	73.8 / 67.6	79.8 / 71.6	75.5 / 66.3	67.9 / 50.3	88.6 / 83.6	95.8 / 81.6	74.4

[1] Results reported by Chalkidis et al. (2021). All large-sized transformer-based models follow the same specifications (L=24, H=1024, A=18).

Medium-sized (👨) Models [2]

Dataset	ECtHR A	ECtHR B	SCOTUS	EUR-LEX	LEDGAR	UNFAIR-ToS	CaseHOLD
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
TFIDF+SVM	62.6 / 48.9	73.0 / 63.8	74.0 / 64.4	63.4 / 47.9	87.0 / 81.4	94.7 / 75.0	22.4
BERT	71.2 / 63.6	79.7 / 73.4	68.3 / 58.3	71.4 / 57.2	87.6 / 81.8	95.6 / 81.3	70.8
RoBERTa	69.2 / 59.0	77.3 / 68.9	71.6 / 62.0	71.9 / 57.9	87.9 / 82.3	95.2 / 79.2	71.4
DeBERTa	70.0 / 60.8	78.8 / 71.0	71.1 / 62.7	72.1 / 57.4	88.2 / 83.1	95.5 / 80.3	72.6
Longformer	69.9 / 64.7	79.4 / 71.7	72.9 / 64.0	71.6 / 57.7	88.2 / 83.0	95.5 / 80.9	71.9
BigBird	70.0 / 62.9	78.8 / 70.9	72.8 / 62.0	71.5 / 56.8	87.8 / 82.6	95.7 / 81.3	70.8
Legal-BERT	70.0 / 64.0	80.4 / 74.7	76.4 / 66.5	72.1 / 57.4	88.2 / 83.0	96.0 / 83.0	75.3
CaseLaw-BERT	69.8 / 62.9	78.8 / 70.3	76.6 / 65.9	70.7 / 56.6	88.3 / 83.0	96.0 / 82.3	75.4

[2] Results reported by Chalkidis et al. (2021). All medium-sized transformer-based models follow the same specifications (L=12, H=768, A=12).

Small-sized (👶) Models [3]

Dataset	ECtHR A	ECtHR B	SCOTUS	EUR-LEX	LEDGAR	UNFAIR-ToS	CaseHOLD
Model	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1	μ-F1 / m-F1
BERT-Tiny	n/a	n/a	62.8 / 40.9	65.5 / 27.5	83.9 / 74.7	94.3 / 11.1	68.3
Mini-LM (v2)	n/a	n/a	60.8 / 45.5	62.2 / 35.6	86.7 / 79.6	93.9 / 13.2	71.3
Distil-BERT	n/a	n/a	67.0 / 55.9	66.0 / 51.5	87.5 / 81.5	97.1 / 79.4	68.6
Legal-BERT	n/a	n/a	75.6 / 68.5	73.4 / 54.4	87.8 /81.4	97.1 / 76.3	74.7

[3] Results reported by Atreya Shankar (@atreyasha) 🤗 🥳. More details (e.g., validation scores, log files) are provided here. The small-sized models' specifications are:

BERT-Tiny (L=2, H=128, A=2) by Turc et al. (2020)
Mini-LM (v2) (L=12, H=386, A=12) by Wang et al. (2020)
Distil-BERT (L=6, H=768, A=12) by Sanh et al. (2019)
Legal-BERT (L=6, H=512, A=8) by Chalkidis et al. (2020)

Frequently Asked Questions (FAQ)

Where are the datasets?

We provide access to LexGLUE on Hugging Face Datasets (Lhoest et al., 2021) at https://huggingface.co/datasets/lex_glue.

For example to load the SCOTUS Spaeth et al. (2020) dataset, you first simply install the datasets python library and then make the following call:

from datasets import load_dataset 
dataset = load_dataset("lex_glue", "scotus")

How to run experiments?

Furthermore, to make reproducing the results for the already examined models or future models even easier, we release our code in this repository. In folder /experiments, there are Python scripts, relying on the Hugging Face Transformers library, to run and evaluate any Transformer-based model (e.g., BERT, RoBERTa, LegalBERT, and their hierarchical variants, as well as, Longforrmer, and BigBird). We also provide bash scripts in folder /scripts to replicate the experiments for each dataset with 5 randoms seeds, as we did for the reported results for the original leaderboard.

Make sure that all required packages are installed:

torch>=1.9.0
transformers>=4.9.0
scikit-learn>=0.24.1
tqdm>=4.61.1
numpy>=1.20.1
datasets>=1.12.1
nltk>=3.5
scipy>=1.6.3

For example to replicate the results for RoBERTa (Liu et al., 2019) on UNFAIR-ToS Lippi et al. (2019), you have to configure the relevant bash script (run_unfair_tos.sh):

> nano run_unfair_tos.sh
GPU_NUMBER=1
MODEL_NAME='roberta-base'
LOWER_CASE='False'
BATCH_SIZE=8
ACCUMULATION_STEPS=1
TASK='unfair_tos'

and then run it:

> sh run_unfair_tos.sh

Note: The bash scripts make use of two HF arguments/parameters (--fp16, --fp16_full_eval), which are only applicable (working) when there are available (and correctly configured) NVIDIA GPUs in a machine station (server or cluster), while also torch is correctly configured to use these compute resources.

So, in case you don't have such resources, just delete these two arguments from the scripts to train models with standard fp32 precision. In case you have such resources, make sure to correctly install the NVIDIA CUDA drivers, and also correctly install torch to identify these resources (Consider this page to figure out the appropriate steps: https://pytorch.org/get-started/locally/)

I don't have the resources to run all these Muppets. What can I do?

You can use Google Colab with GPU acceleration for free online (https://colab.research.google.com).

Set Up a new notebook (https://colab.research.google.com) and git clone the project.
Navigate to Edit → Notebook Settings and select GPU from the Hardware Accelerator drop-down. You will probably get assigned with an NVIDIA Tesla K80 12GB.
You will also have to decrease the batch size and increase the accumulation steps for hierarchical models.

But, this is an interesting open problem (Efficient NLP), please consider using lighter pre-trained (smaller/faster) models, like:

The smaller Legal-BERT of Chalkidis et al. (2020),
Smaller BERT models of Turc et al. (2020),
Mini-LM of Wang et al. (2020),

, or non transformer-based neural models, like:

LSTM-based (Hochreiter and Schmidhuber, 1997) models, like the Hierarchical Attention Network (HAN) of Yang et al. (2016),
Graph-based models, like the Graph Attention Network (GAT) of Veličković et al. (2017)

, or even non neural models, like:

Bag of Word (BoW) models using TF-IDF representations (e.g., SVM, Random Forest),
The eXtreme Gradient Boosting (XGBoost) of Chen and Guestrin (2016),

and report back the results. We are curious!

How to participate?

We are currently still lacking some technical infrastructure, e.g., an integrated submission environment comprised of an automated evaluation and an automatically updated leaderboard. We plan to develop the necessary publicly available web infrastructure extend the public infrastructure of LexGLUE in the near future.

In the mean-time, we ask participants to re-use and expand our code to submit new results, if possible, and open a new discussion (submission) in our repository (https://github.com/coastalcph/lex-glue/discussions/new?category=new-results) presenting their results, providing the auto-generated result logs and the relevant publication (or pre-print), if available, accompanied with a pull request including the code amendments that are needed to reproduce their experiments. Upon reviewing your results, we'll update the public leaderboard accordingly.

I want to re-load fine-tuned HierBERT models. How can I do this?

You can re-load fine-tuned HierBERT models following our example python script "Re-load HierBERT models".

I still have open questions...

Please post your question on Discussions section or communicate with the corresponding author via e-mail.

Credits

Thanks to @JamesLYC88 and @danigoju for digging up for 🐛s!

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

coastalcph / lex-glue

Programming Languages

Labels

Projects that are alternatives of or similar to lex-glue