All Projects → awslabs → gap-text2sql

awslabs / gap-text2sql

Licence: Apache-2.0 license
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to gap-text2sql

Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+2821.69%)
Mutual labels:  nlu, pretrained-models, language-model
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (-21.69%)
Mutual labels:  text-generation, language-model
gpt-j-api
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend
Stars: ✭ 248 (+198.8%)
Mutual labels:  text-generation, language-model
Kogpt2 Finetuning
🔥 Korean GPT-2, KoGPT2 FineTuning cased. 한국어 가사 데이터 학습 🔥
Stars: ✭ 124 (+49.4%)
Mutual labels:  text-generation, language-model
Electra
中文 预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model
Stars: ✭ 132 (+59.04%)
Mutual labels:  pretrained-models, language-model
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+2277.11%)
Mutual labels:  pretrained-models, language-model
Gpt2 Ml
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Stars: ✭ 1,066 (+1184.34%)
Mutual labels:  text-generation, pretrained-models
PCPM
Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.
Stars: ✭ 21 (-74.7%)
Mutual labels:  pretrained-models, language-model
Attention Mechanisms
Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.
Stars: ✭ 203 (+144.58%)
Mutual labels:  text-generation, language-model
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+6867.47%)
Mutual labels:  nlu, pretrained-models
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (+1681.93%)
Mutual labels:  nlu, text-generation
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+67059.04%)
Mutual labels:  pretrained-models, language-model
Azureml Bert
End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
Stars: ✭ 342 (+312.05%)
Mutual labels:  pretrained-models, language-model
Optimus
Optimus: the first large-scale pre-trained VAE language model
Stars: ✭ 180 (+116.87%)
Mutual labels:  pretrained-models, language-model
open clip
An open source implementation of CLIP.
Stars: ✭ 1,534 (+1748.19%)
Mutual labels:  pretrained-models, language-model
Gpt2 French
GPT-2 French demo | Démo française de GPT-2
Stars: ✭ 47 (-43.37%)
Mutual labels:  text-generation, language-model
sede
Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data
Stars: ✭ 83 (+0%)
Mutual labels:  semantic-parsing, text2sql
Crslab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Stars: ✭ 183 (+120.48%)
Mutual labels:  text-generation, pretrained-models
Awesome Pretrained Chinese Nlp Models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合
Stars: ✭ 195 (+134.94%)
Mutual labels:  nlu, pretrained-models
r2sql
🌶️ R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)
Stars: ✭ 60 (-27.71%)
Mutual labels:  semantic-parsing, text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].