awslabs / gap-text2sql

Licence: Apache-2.0 license

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to gap-text2sql

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+2821.69%)

Mutual labels: nlu, pretrained-models, language-model

Deep-NLP-Resources

Curated list of all NLP Resources

Stars: ✭ 65 (-21.69%)

Mutual labels: text-generation, language-model

gpt-j-api

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

Stars: ✭ 248 (+198.8%)

Mutual labels: text-generation, language-model

Kogpt2 Finetuning

🔥 Korean GPT-2, KoGPT2 FineTuning cased. 한국어 가사 데이터 학습 🔥

Stars: ✭ 124 (+49.4%)

Mutual labels: text-generation, language-model

Electra

中文预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model

Stars: ✭ 132 (+59.04%)

Mutual labels: pretrained-models, language-model

Awesome Sentence Embedding

A curated list of pretrained sentence and word embedding models

Stars: ✭ 1,973 (+2277.11%)

Mutual labels: pretrained-models, language-model

Gpt2 Ml

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

Stars: ✭ 1,066 (+1184.34%)

Mutual labels: text-generation, pretrained-models

PCPM

Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.

Stars: ✭ 21 (-74.7%)

Mutual labels: pretrained-models, language-model

Attention Mechanisms

Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.

Stars: ✭ 203 (+144.58%)

Mutual labels: text-generation, language-model

Nlp Recipes

Natural Language Processing Best Practices & Examples

Stars: ✭ 5,783 (+6867.47%)

Mutual labels: nlu, pretrained-models

Delta

DELTA is a deep learning based natural language and speech processing platform.

Stars: ✭ 1,479 (+1681.93%)

Mutual labels: nlu, text-generation

Transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Stars: ✭ 55,742 (+67059.04%)

Mutual labels: pretrained-models, language-model

Azureml Bert

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service

Stars: ✭ 342 (+312.05%)

Mutual labels: pretrained-models, language-model

Optimus

Optimus: the first large-scale pre-trained VAE language model

Stars: ✭ 180 (+116.87%)

Mutual labels: pretrained-models, language-model

open clip

An open source implementation of CLIP.

Stars: ✭ 1,534 (+1748.19%)

Mutual labels: pretrained-models, language-model

Gpt2 French

GPT-2 French demo | Démo française de GPT-2

Stars: ✭ 47 (-43.37%)

Mutual labels: text-generation, language-model

sede

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Stars: ✭ 83 (+0%)

Mutual labels: semantic-parsing, text2sql

Crslab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).

Stars: ✭ 183 (+120.48%)

Mutual labels: text-generation, pretrained-models

Awesome Pretrained Chinese Nlp Models

Awesome Pretrained Chinese NLP Models，高质量中文预训练模型集合

Stars: ✭ 195 (+134.94%)

Mutual labels: nlu, pretrained-models

r2sql

🌶️ R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Stars: ✭ 60 (-27.71%)

Mutual labels: semantic-parsing, text2sql

View All Similar Projects ➔

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

awslabs / gap-text2sql

Programming Languages

Labels

Projects that are alternatives of or similar to gap-text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Updates

Abstract

Setup

Download the dataset

Build dataset directory

Download the library

Start the Stanford library

Download the checkpoint

Preprocess dataset

Inference

Training

Security

License