All Projects → social-machines → Radiotalk

social-machines / Radiotalk

The RadioTalk dataset of talk radio transcripts

Projects that are alternatives of or similar to Radiotalk

Pyhowfun
🌻 提供一系列的 Python 教學,讓初學者或是有程式基礎的人,快速融入 Python 的世界
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Quick Neural Art Transfer
Theano/Lasagne based Neural artistic style transfer with Kivy GUI
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Rl mlss 2020
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Dl Twitch Series
Notebook from the Deep Learning Twitch Series on AWS (https://twitch.tv/aws)
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Ml Tutorial Notebooks
This depot contain tutorials for real beginners who want to understand machine learning by reading some code.
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Twitter Post
This code shows how to get Twitter users interests combining Twitter API and MonkeyLearn.
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Ntm Tensorflow
"Neural Turing Machine" in Tensorflow
Stars: ✭ 1,013 (+2255.81%)
Mutual labels:  jupyter-notebook
Tensorflow Lite Rest Server
Expose tensorflow-lite models via a rest API
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Spark Sklearn Airbnb Predict
Code example to predict prices of Airbnb vacation rentals, using scikit-learn on Spark with spark-sklearn, on MapR.
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Python101
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Adaptive Multispeaker Separation
Adaptive and Focusing Neural Layers for Multi-Speaker Separation Problem
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Pstocky
股票小数据
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Banti telugu ocr
End to end OCR system for Telugu. Based on Convolutional Neural Networks.
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Notebooks
Sample Notebooks for PipelineAI
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Pyvhr
Python framework for Virtual Heart Rate
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Computervision Recipes
Best Practices, code samples, and documentation for Computer Vision.
Stars: ✭ 8,214 (+19002.33%)
Mutual labels:  jupyter-notebook
Plotly Plots
IPython notebooks for Plotly plots
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook
Computervision tutorials
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Dl4sci Pytorch Webinar
Stars: ✭ 43 (+0%)
Mutual labels:  jupyter-notebook
Lovebeat
Super simple heartbeat and metrics monitoring
Stars: ✭ 42 (-2.33%)
Mutual labels:  jupyter-notebook

RadioTalk

This repository contains supplementary information for the paper "RadioTalk: a large-scale corpus of talk radio transcripts", forthcoming at Interspeech 2019.

Data location and access

The corpus as documented in the paper is available in the Amazon AWS S3 bucket radio-talk at s3://radio-talk/v1.0/ (Browse on AWS S3 console)

The entire corpus is available as one file of about 9.3 GB at s3://radio-talk/v1.0/radiotalk.json.gz, and there's also a version with one file per month under s3://radio-talk/v1.0/monthly/. Pre-trained word embeddings are also available. Any future versions will be released under other vX.Y prefixes for suitable values of X and Y.

Data description

The RadioTalk corpus is in JSONL format, with one json document per line. Each line represents one "snippet" of audio, may contain multiple sentences, and is represented as a dictionary object with the following keys:

  • content: The transcribed speech from the snippet.
  • callsign: The call letters of the station the snippet aired on.
  • city: The city the station is based in, as in FCCC filings.
  • state: The state the station is based in, as in FCCC filings.
  • show_name: The name of the show containing this snippet.
  • signature: The initial 8 bytes of an MD5 hash of the content field, after lowercasing and removing English stopwords (specifically the NLTK stopword list), intended to help with deduplication.
  • studio_or_telephone: A flag for whether the underlying audio came from a telephone or studio audio equipment. (The most useful feature in distinguishing these is the narrow frequency range of telephone audio.)
  • guessed_gender: The imputed speaker gender.
  • segment_start_time: The Unix timestamp of the beginning of the underlying audio.
  • segment_end_time: The Unix timestamp of the end of the underlying audio.
  • speaker_id: A diarization ID for the person speaking in the audio snippet.
  • audio_chunk_id: An ID for the audio chunk this snippet came from (each chunk may be split into multiple snippets).

An example snippet from the corpus (originally on one line but pretty-printed here for readability):

{
    "content": "This would be used for housing programs and you talked a little bit about how the attorney",
    "callsign": "KABC",
    "city": "Los Angeles",
    "state": "CA",
    "show_name": "The Drive Home With Jillian Barberie & John Phillips",
    "signature": "afd7d2ee",
    "studio_or_telephone": "T",
    "guessed_gender": "F",
    "segment_start_time": 1540945402.6,
    "segment_end_time": 1540945408.6,
    "speaker_id": "S0",
    "audio_chunk_id": "2018-10-31/KABC/00_20_28/16"
}

Pre-trained word embeddings

A word embedding model trained on the RadioTalk data, in the format produced by gensim, is also available in the bucket, at s3://radio-talk/v1.0/word2vec/. The embeddings are 300-dimensional and were trained with the skip-gram with negative sampling variant of Word2Vec (see Mikolov et al 2013). See also our evaluation of these embeddings on some standard analogy and similarity tasks.

Word embedding details

Besides doing the usual preprocessing -- conversion to lowercase, removing punctuation, etc -- we also concatenated common phrases into single tokens with words separated by underscores before training the embeddings. (Specifically, the list of phrases to combined included the titles of English Wikipedia articles, a list of phrases detected from the corpus, and the names of certain political figures.) Counting these combined collocations as single terms, the model vocabulary contains 53,968 terms.

For reproducibility, the gensim model object was initialized with the following non-default parameters:

  • size = 300
  • sg = 1
  • hs = 0
  • negative = 10
  • window = 8
  • min_count = 25
  • workers = 4

Kaldi model

As discussed in the paper, to transcribe radio speech we started with the JHU ASpIRE speech-to-text model and replaced its language model with one trained on the transcripts of various radio programs. Our final Kaldi model files (which can be used as drop-in replacements for the outputs of the recipe linked above) can be downloaded from s3://radio-talk/v1.0/models/radiotalk_kaldi_model_20191106.tgz (Note: 1.9 GB compressed, 4.8 GB uncompressed)

Initial station sample

The initial set of 50 radio stations for ingestion was chosen from the universe of all 1,912 talk radio stations as follows. First, we excluded certain stations from consideration:

  • stations without an online stream of their broadcasts,
  • stations in Alaska or Hawaii, and
  • the recently licensed category of "low-power FM stations".

Next, we took a random sample of 50 stations from the remaining 1,842, stratifying by four variables:

  • Radio band (AM or FM),
  • Four-way Census region (Midwest, Northeast, South, West) based on the containing state,
  • Whether there were at least 10 stations listed in that station's city, as a proxy for population density, and
  • Whether the station was in a battleground state for the 2016 presidential election. Battleground states for our purposes were NV, AZ, CO, IA, WI, MI, OH, PA, VA, NC, FL, NH.

This sample was intended to be nationally representative and to permit weighting summary estimates back to the population of radio stations along these four variables. Note that some (8 as of June 2019) of the selected stations have either ceased airing a talk format or no longer offer an online stream of their broadcasts, and are thus not included in the later parts of the corpus.

The list of these initial stations is included in the file talk_radio_sample.csv.

Demo

This interface lets you listen to a sample of radio clips restricted to the topic and U.S. state of your choosing: https://radio.cortico.ai/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].