All Projects → hirupert → sede

hirupert / sede

Licence: Apache-2.0 license
Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to sede

gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (+0%)
Mutual labels:  semantic-parsing, text2sql
r2sql
🌶️ R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)
Stars: ✭ 60 (-27.71%)
Mutual labels:  semantic-parsing, text2sql
bilibili-smallvideo
🕷️用于爬取B站前top100的小视频
Stars: ✭ 133 (+60.24%)
Mutual labels:  spider
tuchong Spider
⭐ 图虫网爬虫
Stars: ✭ 16 (-80.72%)
Mutual labels:  spider
grapy
Grapy, a fast high-level web crawling framework for Python 3.3 or later base on asyncio.
Stars: ✭ 18 (-78.31%)
Mutual labels:  spider
GitHub-Trending-Crawler
Crawling GitHub Trending Pages every day
Stars: ✭ 55 (-33.73%)
Mutual labels:  spider
ZSpider
基于Electron爬虫程序
Stars: ✭ 37 (-55.42%)
Mutual labels:  spider
t5-japanese
Codes to pre-train Japanese T5 models
Stars: ✭ 39 (-53.01%)
Mutual labels:  t5
dcard-spider
A spider on Dcard. Strong and speedy.
Stars: ✭ 91 (+9.64%)
Mutual labels:  spider
weixin article spiders
A spiders' program for weixin which made by Express & cheerio
Stars: ✭ 33 (-60.24%)
Mutual labels:  spider
php-crawler
🕷️ A simple crawler (spider) writen in php just for fun, with zero dependencies
Stars: ✭ 39 (-53.01%)
Mutual labels:  spider
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-36.14%)
Mutual labels:  spider
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (-27.71%)
Mutual labels:  spider
article-spider
文章采集工具 Article collection tool
Stars: ✭ 130 (+56.63%)
Mutual labels:  spider
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+1.2%)
Mutual labels:  spider
fastT5
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.
Stars: ✭ 421 (+407.23%)
Mutual labels:  t5
TaobaoSpider
This taobao spider has been archived
Stars: ✭ 28 (-66.27%)
Mutual labels:  spider
spider-mzitu
妹子图
Stars: ✭ 13 (-84.34%)
Mutual labels:  spider
gospider
⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架
Stars: ✭ 183 (+120.48%)
Mutual labels:  spider
gathertool
gathertool是golang脚本化开发库,目的是提高对应场景程序开发的效率;轻量级爬虫库,接口测试&压力测试库,DB操作库等。
Stars: ✭ 36 (-56.63%)
Mutual labels:  spider

SEDE

sede ci

SEDE (Stack Exchange Data Explorer) is new dataset for Text-to-SQL tasks with more than 12,000 SQL queries and their natural language description. It's based on a real usage of users from the Stack Exchange Data Explorer platform, which brings complexities and challenges never seen before in any other semantic parsing dataset like including complex nesting, dates manipulation, numeric and text manipulation, parameters, and most importantly: under-specification and hidden-assumptions.

Paper (NLP4Prog workshop at ACL2021): Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data.


sede sql

Setup Instructions

Create a new Python 3.7 virtual environment:

python3.7 -m venv .venv

Activate the virtual environment:

source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Add the project directory to python PATH:

export PYTHONPATH=/your/projects-directories/sede:$PYTHONPATH

One can run all commands by just running make command, or running them step by step by the following commands:

Run pylint:

make lint

Run black:

make black_check

Run tests (required JSQL running for this - please see "Running JSQLParser" chapter):

make unit_test

Add the virtual environment to Jupyter Notebook:

python3.7 -m ipykernel install --user --name=.venv

Now you can enter into Jupyter with the command jupyter notebook and when creating a new notebook you will need to choose the .venv environment.

Folders Navigation

  • src - source code
  • configs - contains configuration files for running experiments
  • data/sede - train/val/test sets of SEDE. Note - files with the _original suffix are the ones that we kept original as coming from SEDE without our fixes. See our paper for more details.
  • notebooks - some helper Jupyter notebooks.
  • stackexchange_schema - holds file that respresents the SEDE schema.

Running JSQLParser

Clone JSQLParser-as-a-Service project: git clone https://github.com/hirupert/jsqlparser-as-a-service.git

Enter the folder with cd jsqlparser-as-a-service

Build the JSQLParser-as-a-Service image using the following command: docker build -t jsqlparser-as-a-service .

Running the image inside a docker container in port 8079: docker run -d -p 8079:8079 jsqlparser-as-a-service

Test that the docker is running by running the following command:

curl --location --request POST 'http://localhost:8079/sqltojson' \
--header 'Content-Type: application/json' \
--data-raw '{
    "sql":"select salary from employees where salary < (select max(salary) from employees)"
}'

Training T5 model

Training SEDE:

python main_allennlp.py train configs/t5_text2sql_sede.jsonnet -s experiments/name_of_experiment --include-package src

Training Spider:

In order to run our model + Partial Components Match F1 metric on Spider dataset, one must download Spider dataset from here: https://yale-lily.github.io/spider and save it under data/spider folder inside the root project directory. After that, one can run the following command in order to train our model on Spider dataset:

python main_allennlp.py train configs/t5_text2sql_spider.jsonnet -s experiments/name_of_experiment --include-package src

Evaluation (SEDE)

Run evaluation on SEDE validation set with:

python main_allennlp.py evaluate experiments/name_of_experiment data/sede/val.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --cuda-device 0 --batch-size 10 --include-package src

Run evaluation on SEDE test set with:

python main_allennlp.py evaluate experiments/name_of_experiment data/sede/test.jsonl --output-file experiments/name_of_experiment/test_predictions.sql --cuda-device 0 --batch-size 10 --include-package src

Note - In order to evaluate a trained model on Spider, one needs to replace the experiment name and the data path to: data/spider/dev.json.

Inference (SEDE)

Predict SQL queries on SEDE validation set with:

python main_allennlp.py predict experiments/name_of_experiment data/sede/val.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --use-dataset-reader --predictor seq2seq2 --cuda-device 0 --batch-size 10 --include-package src

Predict SQL queries on SEDE test set with:

python main_allennlp.py predict experiments/name_of_experiment data/sede/test.jsonl --output-file experiments/name_of_experiment/val_predictions.sql --use-dataset-reader --predictor seq2seq2 --cuda-device 0 --batch-size 10 --include-package src

Note - In order to run inference with a trained model on Spider (validation set), one needs to replace the experiment name and the data path to: data/spider/dev.json.

Acknowledgements

We thank Kevin Montrose and the rest of the Stack Exchange team for providing the raw query log.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].