All Projects → RicherMans → AudioCaption

RicherMans / AudioCaption

Licence: MIT license
Dataset and baseline for the first Audiocaption task

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to AudioCaption

Mead Baseline
Deep-Learning Model Exploration and Development for NLP
Stars: ✭ 238 (+260.61%)
Mutual labels:  baseline
postgres-baseline
DevSec PostgreSQL Baseline - InSpec Profile
Stars: ✭ 47 (-28.79%)
Mutual labels:  baseline
Speaker-Anti-Spoofing-Classifiers
Baselines and Classifiers for speaker anti-spoofing detection
Stars: ✭ 15 (-77.27%)
Mutual labels:  baseline
mujoco-benchmark
Provide full reinforcement learning benchmark on mujoco environments, including ddpg, sac, td3, pg, a2c, ppo, library
Stars: ✭ 101 (+53.03%)
Mutual labels:  baseline
baset
Testing tool for baseline strategy
Stars: ✭ 26 (-60.61%)
Mutual labels:  baseline
DCASE2019 task4
Baseline of dcase 2019 task 4
Stars: ✭ 55 (-16.67%)
Mutual labels:  baseline
Recheck Web
recheck for web apps – change comparison tool with local Golden Masters, Git-like ignore syntax and "Unbreakable Selenium" tests.
Stars: ✭ 224 (+239.39%)
Mutual labels:  baseline
efivalidate
Firmware validation of Apple T1 and prior Macs
Stars: ✭ 17 (-74.24%)
Mutual labels:  baseline
naacl2018-fever
Fact Extraction and VERification baseline published in NAACL2018
Stars: ✭ 109 (+65.15%)
Mutual labels:  baseline
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+77.27%)
Mutual labels:  baseline
redmine evm
Earned Value Management plugin for Redmine
Stars: ✭ 33 (-50%)
Mutual labels:  baseline
revc
The fastest and safest EVC encoder and decoder
Stars: ✭ 75 (+13.64%)
Mutual labels:  baseline
l2rpn-baselines
L2RPN Baselines a repository to host baselines for l2rpn competitions.
Stars: ✭ 57 (-13.64%)
Mutual labels:  baseline
GLUE-bert4keras
基于bert4keras的GLUE基准代码
Stars: ✭ 59 (-10.61%)
Mutual labels:  baseline
cis-dil-benchmark
CIS Distribution Independent Linux Benchmark - InSpec Profile
Stars: ✭ 120 (+81.82%)
Mutual labels:  baseline
Person reid baseline pytorch
Pytorch ReID: A tiny, friendly, strong pytorch implement of object re-identification baseline. Tutorial 👉https://github.com/layumi/Person_reID_baseline_pytorch/tree/master/tutorial
Stars: ✭ 2,963 (+4389.39%)
Mutual labels:  baseline
freqtrade-gym
A customized gym environment for developing and comparing reinforcement learning algorithms in crypto trading.
Stars: ✭ 192 (+190.91%)
Mutual labels:  baseline
baseline
New method for creating leading on the web
Stars: ✭ 31 (-53.03%)
Mutual labels:  baseline
vcc20 baseline cyclevae
Voice Conversion Challenge 2020 CycleVAE baseline system
Stars: ✭ 123 (+86.36%)
Mutual labels:  baseline
text-classification-baseline
Pipeline for fast building text classification TF-IDF + LogReg baselines.
Stars: ✭ 55 (-16.67%)
Mutual labels:  baseline

AudioCaption : Listen and Tell

This repository provides source code for several models on audio captioning as well as several datasets.

Firstly please checkout this repository.

git clone --recurse-submodules https://www.github.com/Richermans/AudioCaption

Dataset

The two datasets, hospital and car, can be downloaded via Zenodo: DOI.

Related Papers

Here are papers related to this repository:

If you'd like to use the AudioCaption dataset, please cite:

@inproceedings{Wu2019,
  author    = {Mengyue Wu and
               Heinrich Dinkel and
               Kai Yu},
  title     = {Audio Caption: Listen and Tell},
  booktitle = {{IEEE} International Conference on Acoustics, Speech and Signal Processing,
               {ICASSP} 2019, Brighton, United Kingdom, May 12-17, 2019},
  pages     = {830--834},
  publisher = {{IEEE}},
  year      = {2019},
  url       = {https://doi.org/10.1109/ICASSP.2019.8682377},
  doi       = {10.1109/ICASSP.2019.8682377},
  timestamp = {Wed, 16 Oct 2019 14:14:52 +0200},
}

Baseline

In order to sucessfully run the baseline, the following packages and frameworks are required:

  1. Kaldi (mostly for data processing)
  2. A bunch of Python3 packages ( most notably torch, see requirements.txt )

Prequisite Installation

The code is written exclusively in Python3. In order to install all required packages use the included requirements.txt. pip install -r requirements.txt does the job.

Kaldi

For this code, only the feature pipeline of kaldi is utlilized, thus only the feature packages need to be installed in order to function

git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream
cd kaldi && git pull
cd tools; make
cd ../src; make -j4 featbin

Lastly, create a new environment variable for the kaldi_io.py script to function properly. Either locally export in your current session the variable KALDI_ROOT or put it into ~/.bashrc or ~/.profile.

export KALDI_ROOT=/PATH/TO/YOUR/KALDI

(Optional) NLP Tokenizer

This repository already provided the tokenized dataset in the json format. However, if one wishes to tokenize differently (e.g., tokenize by some custom NLP tokenizer), we also provide a simple script to install and run the Stanford NLP Tokenizer.

This dataset is labelled in Chinese. Chinese has some specific differences to most Indo-European languages, including its script. In particular, Chinese does not use an indicator for word separation, as English does with a blank space. Rather it depends on the reader to split a sentence into semantically sound tokens.

However, the Stanford CoreNLP software provides support for tokenization of Chinese. The script prepare_dataserver.sh downloads all the necessary plugins for the CoreNLP tool in order to enable tokenization. The script utils/build_vocab.py does need a running server in the background in order to work.

Downloading and running the CoreNLP tokenization server only needs to execute:

bash scripts/prepare_dataserver.sh

It requires at least java being installed on your machine. It is recommended to run this script in the background.

(Optional) BERT Pretrained Embeddings

In this paper, BERT embeddings are used to provide sequence-level supervision. The scripts in utils/bert need bert-as-service running in the background.

To use bert-as-service, you need to first install the repository. It is recommended that you create a new environment with Tensorflow 1.3 to run BERT server since it is incompatible with Tensorflow 2.x.

After successful installation of bert-as-service, downloading and running the BERT server:

bash scripts/prepare_bert_server.sh <path-to-server> <num-workers> zh

By default, server based on BERT base Chinese model is running in the background. You can change to other models by changing corresponding model name and path in scripts/prepare_bert_server.sh.

To extract sentence embeddings, you need to execute utils/bert/create_sent_embedding.py, where the usage is shown.

Extract Features

The kaldi scp format requires a tab or space separated line with the information: FEATURENAME WAVEPATH

For example, to extract feature from hospital data, assume the raw data is placed in DATA_DIR (data/hospital/wav here) and you will store features in FEATURE_DIR (data/hospital here). The annotation files (zh_dev.json and zh_eval.json) are also placed in FEATURE_DIR.

DATA_DIR=`pwd`/data/hospital/wav
FEATURE_DIR=`pwd`/data/hospital
PREFIX=hospital
find $DATA_DIR -type f | awk -F[./] '{print "'$PREFIX'""_"$(NF-1),$0}' > $FEATURE_DIR/wav.scp
  • Filterbank:
compute-fbank-feats --config=config/kaldi/fbank.conf scp:$FEATURE_DIR/wav.scp ark:$FEATURE_DIR/fbank.ark
python utils/copyark2hdf5.py $FEATURE_DIR/fbank.ark $FEATURE_DIR/fbank.hdf5
rm $FEATURE_DIR/fbank.ark
  • Logmelspectrogram:
python utils/extract_feat.py $FEATURE_DIR/wav.scp $FEATURE_DIR/logmel.hdf5 $FEATURE_DIR/logmel.scp mfcc -win_length 1764 -hop_length 882

The kaldi scp file can be further split into a development scp and an evaluation scp:

python utils/split_scp.py $FEATURE_DIR/fbank.scp $FEATURE_DIR/zh_eval.json
python utils/split_scp.py $FEATURE_DIR/logmel.scp $FEATURE_DIR/zh_eval.json

Dump vocabulary

Vocabulary should be prepared and dumped to a file for later use. You can run the following command to use the tokenized hospital dataset:

python utils/build_vocab.py "['data/hospital/zh_dev.json', 'data/hospital/zh_eval.json']" data/hospital/vocab_zh.pth --pretokenized True

A vocabulary file data/hospital/vocab_zh.pth will be generated.

Training Configurator

Training configuration is done in config/*.yaml. Here one can adjust some hyperparameters e.g., number of hidden layers or embedding size. You can also write your own models in models/*.py and adjust the config to use that model (e.g. encoder: MYMODEL).

Note: All parameters within the runners/*.py script use exclusively parameters with the same name as their .yaml file counterpart. They can all be switched and changed on the fly by passing --ARG VALUE, e.g., if one wishes to switch the captions file to use English captions, pass --caption_file data/hospital/en_dev.json.

Training models

In order to train a model (for example using standard cross entropy loss), simply run:

python runners/run.py train config/xe.yaml

This will store the training logs and model checkpoints in OUTPUTPATH/MODEL/TIMESTAMP.

Predicting and Evaluating

Predicting and evaluating is done by running evaluate (e.g. using Logmelspectrogram feature):

export experiment_path=experiments/***
python runners/run.py predict_evaluate $experiment_path $FEATURE_DIR/logmel.hdf5 $FEATURE_DIR/logmel_eval.scp $FEATURE_DIR/zh_eval.json

Standard machine translation metrics (BLEU@1-4, ROUGE-L, CIDEr, METEOR and SPICE) are included, where METEOR and SPICE can only be used on English datasets.

If you just want to do inference, do not provide caption reference file:

python runners/run.py predict_evaluate $experiment_path $FEATURE_DIR/logmel.hdf5 $FEATURE_DIR/logmel_eval.scp
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].