All Projects → rkadlec → Ubuntu Ranking Dataset Creator

rkadlec / Ubuntu Ranking Dataset Creator

Licence: apache-2.0
A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.

Projects that are alternatives of or similar to Ubuntu Ranking Dataset Creator

Python Deepdive
Python Deep Dive Course - Accompanying Materials
Stars: ✭ 590 (-3.12%)
Mutual labels:  jupyter-notebook
Neuraltalk2
Efficient Image Captioning code in Torch, runs on GPU
Stars: ✭ 5,263 (+764.2%)
Mutual labels:  jupyter-notebook
Tutorial
Stars: ✭ 602 (-1.15%)
Mutual labels:  jupyter-notebook
Yolov3 Complete Pruning
提供对YOLOv3及Tiny的多种剪枝版本,以适应不同的需求。
Stars: ✭ 596 (-2.13%)
Mutual labels:  jupyter-notebook
Time Series Classification And Clustering
Time series classification and clustering code written in Python.
Stars: ✭ 599 (-1.64%)
Mutual labels:  jupyter-notebook
Machine learning tutorials
Code, exercises and tutorials of my personal blog ! 📝
Stars: ✭ 601 (-1.31%)
Mutual labels:  jupyter-notebook
Telemanom
A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.
Stars: ✭ 589 (-3.28%)
Mutual labels:  jupyter-notebook
Stock Analysis Engine
Backtest 1000s of minute-by-minute trading algorithms for training AI with automated pricing data from: IEX, Tradier and FinViz. Datasets and trading performance automatically published to S3 for building AI training datasets for teaching DNNs how to trade. Runs on Kubernetes and docker-compose. >150 million trading history rows generated from +5000 algorithms. Heads up: Yahoo's Finance API was disabled on 2019-01-03 https://developer.yahoo.com/yql/
Stars: ✭ 605 (-0.66%)
Mutual labels:  jupyter-notebook
Courses
fast.ai Courses
Stars: ✭ 5,253 (+762.56%)
Mutual labels:  jupyter-notebook
Cs231n spring 2017 assignment
My implementations of cs231n 2017
Stars: ✭ 603 (-0.99%)
Mutual labels:  jupyter-notebook
Basic model scratch
Implementation of some classic Machine Learning model from scratch and benchmarking against popular ML library
Stars: ✭ 595 (-2.3%)
Mutual labels:  jupyter-notebook
Deeplearning Nlp
Introduction to Deep Learning for Natural Language Processing
Stars: ✭ 598 (-1.81%)
Mutual labels:  jupyter-notebook
Sqlitebiter
A CLI tool to convert CSV / Excel / HTML / JSON / Jupyter Notebook / LDJSON / LTSV / Markdown / SQLite / SSV / TSV / Google-Sheets to a SQLite database file.
Stars: ✭ 601 (-1.31%)
Mutual labels:  jupyter-notebook
Single Cell Tutorial
Single cell current best practices tutorial case study for the paper:Luecken and Theis, "Current best practices in single-cell RNA-seq analysis: a tutorial"
Stars: ✭ 594 (-2.46%)
Mutual labels:  jupyter-notebook
Challenges
PyBites Code Challenges
Stars: ✭ 604 (-0.82%)
Mutual labels:  jupyter-notebook
Kobert
Korean BERT pre-trained cased (KoBERT)
Stars: ✭ 591 (-2.96%)
Mutual labels:  jupyter-notebook
Deep learning cookbook
Deep Learning Cookbox
Stars: ✭ 601 (-1.31%)
Mutual labels:  jupyter-notebook
Info8010 Deep Learning
Lectures for INFO8010 - Deep Learning, ULiège
Stars: ✭ 608 (-0.16%)
Mutual labels:  jupyter-notebook
K Nearest Neighbors With Dynamic Time Warping
Python implementation of KNN and DTW classification algorithm
Stars: ✭ 604 (-0.82%)
Mutual labels:  jupyter-notebook
Fastai dev
fast.ai early development experiments
Stars: ✭ 604 (-0.82%)
Mutual labels:  jupyter-notebook

README -- Ubuntu Dialogue Corpus v2.0

We describe the files for generating the Ubuntu Dialogue Corpus, and the dataset itself.

UPDATES FROM UBUNTU CORPUS v1.0:

There are several updates and bug fixes that are present in v2.0. The updates are significant enough that results on the two datasets will not be equivalent, and should not be compared. However, models that do well on the first dataset should transfer to the second dataset (with perhaps a new hyperparameter search).

  • Separated the train/validation/test sets by time. The training set goes from the beginning (2004) to about April 27, 2012, the validation set goes from April 27 to August 7, 2012, and the test set goes from August 7 to December 1, 2012. This more closely mimics real life implementation, where you are training a model on past data to predict future data.
  • Changed the sampling procedure for the context length in the validation and test sets, from an inverse distribution to a uniform distribution (between 2 and the max context size). This increases the average context length, which we consider desirable since we would like to model long-term dependencies.
  • Changed the tokenization and entity replacement procedure. After complaints stating v1 was too aggressive, we've decided to remove these. It is up to each person using the dataset to come up with their own tokenization/ entity replacement scheme. We plan to use the tokenization internally.
  • Added differentiation between the end of an utterance (__eou__) and end of turn (__eot__). In the original dataset, we concatenated all consecutive utterances by the same user in to one utterance, and put __EOS__ at the end. Here, we also denote where the original utterances were (with __eou__). Also, the terminology should now be consistent between the training and test set (instead of both __EOS__ and </s>).
  • Fixed a bug that caused the distribution of false responses in the test and validation sets to be different from the true responses. In particular, the number of words in the false responses was shorter on average than for the true responses, which could have been exploited by some models.

UBUNTU CORPUS GENERATION FILES:

generate.sh:

DESCRIPTION:

Script that calls create_ubuntu_dataset.py. This is the script you should run in order to download the dataset. The parameters passed to this script will be passed to create_ubuntu_dataset.py. Example usage: ./generate.sh -t -s -l.

create_ubuntu_dataset.py:

DESCRIPTION:

Script for generation of train, test and valid datasets from Ubuntu Corpus 1 on 1 dialogs. The script downloads 1on1 dialogs from internet and then it randomly samples all the datasets with positive and negative examples. Copyright IBM 2015

ARGUMENTS:

  • --data_root: directory where 1on1 dialogs will downloaded and extracted, the data will be downloaded from cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz (default = '.')
  • --seed: seed for random number generator (default = 1234)
  • -o, --output: output file for writing to csv (default = None)
  • -t, --tokenize: tokenize the output (nltk.word_tokenize)
  • -s, --stem: stem the output (nltk.stem.SnowballStemmer) - applied only when -t flag is present
  • -l, --lemmatize: lemmatize the output (nltk.stem.WorldNetLemmatizer) - applied only when -t flag is present

Note: if both -s and -l are present, the stemmer is applied before the lemmatizer.

Subparsers:

train: train set generator

  • -p: positive example probability, ie. the ratio of positive examples to total examples in the training set (default = 0.5)
  • -e, --examples: number of training examples to generate. Note that this will generate slightly fewer examples than desired, as there is a 'post-processing' step that filters (default = 1000000)

valid: validation set generator

  • -n: number of distractor examples for each context (default = 9)

test: test set generator

  • -n: number of distractor examples for each context (default = 9)

meta folder: trainfiles.csv, valfiles.csv, testfiles.csv:

DESCRIPTION:

Maps the original dialogue files to the training, validation, and test sets.

UBUNTU CORPUS FILES (after generating):

train.csv:

Contains the training set. It is separated into 3 columns: the context of the conversation, the candidate response or 'utterance', and a flag or 'label' (= 0 or 1) denoting whether the response is a 'true response' to the context (flag = 1), or a randomly drawn response from elsewhere in the dataset (flag = 0). This triples format is described in the paper. When generated with the default settings, train.csv is 463Mb, with 1,000,000 lines (ie. examples, which corresponds to 449,071 dialogues) and with a vocabulary size of 1,344,621. Note that, to generate the full dataset, you should use the --examples argument for the create_ubuntu_dataset.py file.

valid.csv:

Contains the validation set. Each row represents a question. Separated into 11 columns: the context, the true response or 'ground truth utterance', and 9 false responses or 'distractors' that were randomly sampled from elsewhere in the dataset. Your model gets a question correct if it selects the ground truth utterance from amongst the 10 possible responses. When generated with the default settings, valid.csv is 27Mb, with 19,561 lines and a vocabulary size of 115,688.

test.csv:

Contains the test set. Formatted in the same way as the validation set. When generated with the default settings, test.csv is 27Mb, with 18,921 lines and a vocabulary size of 115,623.

BASELINE RESULTS

Dual Encoder LSTM model:

1 in 2:
	[email protected]: 0.868730970907
1 in 10:
	[email protected]: 0.552213717862 
	[email protected]: 0.72099120433, 
	[email protected]: 0.924285351827 

Dual Encoder RNN model:

1 in 2:
	[email protected]: 0.776539210705,
1 in 10:
	[email protected]: 0.379139142954, 
	[email protected]: 0.560689786585, 
	[email protected]: 0.836350355691,

TF-IDF model:

1 in 2:
	[email protected]:  0.749260042283
1 in 10:
	[email protected]:  0.48810782241
	[email protected]:  0.587315010571
	[email protected]:  0.763054968288

HYPERPARAMETERS USED

Code for the model can be found here (might not be up to date with the new dataset): https://github.com/npow/ubottu

Dual Encoder LSTM model:

act_penalty=500
batch_size=256
conv_attn=False 
corr_penalty=0.0
emb_penalty=0.001
fine_tune_M=True
fine_tune_W=False
forget_gate_bias=2.0
hidden_size=200
is_bidirectional=False
lr=0.001
lr_decay=0.95
max_seqlen=160
n_epochs=100
n_recurrent_layers=1
optimizer='adam'
penalize_activations=False
penalize_emb_drift=False
penalize_emb_norm=False
pv_ndims=100
seed=42
shuffle_batch=False
sort_by_len=False
sqr_norm_lim=1
use_pv=False
xcov_penalty=0.0

Dual Encoder RNN model:

act_penalty=500
batch_size=512
conv_attn=False
corr_penalty=0.0
emb_penalty=0.001
fine_tune_M=False
fine_tune_W=False
forget_gate_bias=2.0
hidden_size=100
is_bidirectional=False
lr=0.0001
lr_decay=0.95
max_seqlen=160
n_epochs=100
n_recurrent_layers=1
optimizer='adam'
penalize_activations=False
penalize_emb_drift=False
penalize_emb_norm=False
pv_ndims=100
seed=42
shuffle_batch=False
sort_by_len=False
sqr_norm_lim=1
use_pv=False
xcov_penalty=0.0
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].