All Projects → kevinschaich → billboard

kevinschaich / billboard

Licence: MIT License
🎤 Lyrics/associated NLP data for Billboard's Top 100, 1950-2015.

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
CSS
56736 projects
HTML
75241 projects

Projects that are alternatives of or similar to billboard

Stocksight
Stock market analyzer and predictor using Elasticsearch, Twitter, News headlines and Python natural language processing and sentiment analysis
Stars: ✭ 1,037 (+1856.6%)
Mutual labels:  sentiment-analysis, sentiment, nltk
Sentiment-analysis-amazon-Products-Reviews
NLP with NLTK for Sentiment analysis amazon Products Reviews
Stars: ✭ 37 (-30.19%)
Mutual labels:  sentiment-analysis, nltk, sentiment-classification
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+2035.85%)
Mutual labels:  sentiment-analysis, sentiment, nltk
Senti4SD
An emotion-polarity classifier specifically trained on developers' communication channels
Stars: ✭ 41 (-22.64%)
Mutual labels:  sentiment-analysis, sentiment, sentiment-classification
wink-sentiment
Accurate and fast sentiment scoring of phrases with #hashtags, emoticons :) & emojis 🎉
Stars: ✭ 51 (-3.77%)
Mutual labels:  sentiment-analysis, sentiment, sentiment-classification
brand-sentiment-analysis
Scripts utilizing Heartex platform to build brand sentiment analysis from the news
Stars: ✭ 21 (-60.38%)
Mutual labels:  sentiment-analysis, sentiment, sentiment-classification
d3.geometer
[NOT MAINTAINED] A D3js library for drawing polytopes, angles, coordinates, geometries and more.
Stars: ✭ 18 (-66.04%)
Mutual labels:  d3, d3js
k8s-graph
Visualize your Kubernetes (k8s) cluster
Stars: ✭ 23 (-56.6%)
Mutual labels:  d3, d3js
reddit-opinion-mining
Sentiment analysis and opinion mining of Reddit data.
Stars: ✭ 15 (-71.7%)
Mutual labels:  sentiment-analysis, nltk
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-39.62%)
Mutual labels:  sentiment-analysis, sentiment-classification
Dataset-Sentimen-Analisis-Bahasa-Indonesia
Repositori ini merupakan kumpulan dataset terkait analisis sentimen Berbahasa Indonesia. Apabila Anda menggunakan dataset-dataset yang ada pada repositori ini untuk penelitian, maka cantumkanlah/kutiplah jurnal artikel terkait dataset tersebut. Dataset yang tersedia telah diimplementasikan dalam beberapa penelitian dan hasilnya telah dipublikasi…
Stars: ✭ 38 (-28.3%)
Mutual labels:  sentiment-analysis, sentiment-classification
real-time-data-viz-d3-crossfilter-websocket-tutorial
Tutorial on real-time data visualization. Python websocket server & d3.js + crossfilter.js frontend
Stars: ✭ 32 (-39.62%)
Mutual labels:  d3, d3js
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-75.47%)
Mutual labels:  d3, d3js
sentistrength id
Sentiment Strength Detection in Bahasa Indonesia
Stars: ✭ 32 (-39.62%)
Mutual labels:  sentiment-analysis, sentiment-classification
Simple-charts
Simple responsive charts
Stars: ✭ 15 (-71.7%)
Mutual labels:  d3, d3js
d3-gridding
grids for rapid D3 charts mockups
Stars: ✭ 100 (+88.68%)
Mutual labels:  d3, d3js
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-54.72%)
Mutual labels:  sentiment-analysis, sentiment-classification
NTUA-slp-nlp
💻Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA
Stars: ✭ 19 (-64.15%)
Mutual labels:  sentiment-analysis, sentiment-classification
d3-cv.js
Render your CV with some d3 goodies.
Stars: ✭ 12 (-77.36%)
Mutual labels:  d3, d3js
align covid
Coronavirus time series aligned by number of cases, not date.
Stars: ✭ 22 (-58.49%)
Mutual labels:  d3, d3js

Introduction

Our project researched and visualized how lyrics and associated data of popular songs have evolved since 1950. We grabbed the top 100 songs on Billboard for each year, and used natural language processing to analyze a variety of metrics. Users can interactively choose a year/genre range they are interested in to get a closer look at subtleties.

A joint project between Juhee Lee, Yinan Wen, and Kevin Schaich

Data

Crawling/Analysis

Billboard Top 100 Songs

Our only initial dataset comes from Billboard's Top 100. We grabbed a CSV file of the top 100 songs for the years 1950 - 2015 from reddit's r/datasets.

Lyrics

Using Wikia Lyrics, as well as its Python counterpart on Github, Heroku API we scraped each song's title/artist combination and downloaded the song's full lyrics.

Please note that this data is not intended to be a full set, as we ran into some problems along the way with slight inconsistencies in our dataset's naming schemes vs. the API's request structure. Many lyrics for older songs in the 1950-60's are less readily available online as well, however we have about 80-90% coverage for all the songs on Billboard's list in our year range.

Genres/Tags

Using the MusicBrainz API as well as the Python interface Musicbrainzng, we scraped each song artist's associated genre tags. These tags are quite numerous and extensive, so we came up with a total of 15 'aggregate genres' based on their total occurrence rate in all our songs to represent the aggregate of our data and to keep our visualization clean. A minified sample of these aggregates can be found below:

aggregate_genres = [
{"rock" = ["pop rock", "jazz-rock", "heartland rock", ...]},
{"alternative/indie" = [...]},
{"electronic/dance" = [...]},
{"soul" = [...]},
{"classical/soundtrack" = [...]},
{"pop" = [...]},
{"hip-hop/rnb" = [...]},
{"disco" = [...]},
{"swing" = [...]},
{"folk" = [...]},
{"country" = [...]},
{"jazz" = [...]},
{"religious" = [...]},
{"blues" = [...]},
{"reggae" = [...]},
]

Sentiment Analysis

Using the Natural Language Toolkit (NLTK) for Python, we used the VADER model for parsimonious rule-based sentiment analysis of each song's lyrics. Each song was run through a sentiment analyzer and output an object with data about its sentiment:

"sentiment": {
    "neg": [float],             # Negativity assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% negative).
    "neu": [float],             # Neutrality assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% neutral).
    "pos": [float],             # Positivity assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% positive).
    "compound": [float]
}

The pos, neg, and neu are the three interesting values. Each value represents the percentage probability that the song is associated with a positive, negative, or neutral connotation and sentiment, respectively. Using the positive and negative values, we are able to tell whether a song's lyrics lean towards "happy" or "sad" in demeanor.

Readability Metrics

Using the textstat package for Python, we calculated a number of aggregate readability metrics associated with each song's lyrics:

"num_words": [int],             # Number of words in lyrics.
"num_lines": [int],             # Number of lines in lyrics.
"num_syllables": [int],         # Number of syllables in lyrics.
"difficult_words": [int],       # Number of words not on the Dale–Chall "easy" word list.
"fog_index": [float],           # Gunning-Fog readability index.
"flesch_index": [float],        # Flesch reading ease score.
"f_k_grade": [float],           # Flesch–Kincaid grade level of lyrics.

While the top few are most explanatory, the Gunning-Fog Index and Flesh-Kincaid Grade Level are the most powerful. Both of these metrics use a variety of linguistics data like average sentence length, word, length, and complexity/number of syllables to determine the readability of a text. These metrics allows us to graph the trend over time for specific genres, i.e. you would need to be in 2nd grade to understand the average pop song from 1972.

Repetition

For each song, we count the number of duplicate lines that appear in the lyrics. This can be used as a rough measure of repetition in the song content, i.e. the more duplicate lines in the lyrics, the more repetitive a song is.

Data Summary

Our aggregate data JSON file includes all of the following metrics:

{
    "title": [string],              # Title of the song.       
    "artist": [string],             # Artist of the song.
    "year": [int],                  # Release year of the song.
    "pos": [int],                   # Position of Billboard's Top 100 for year [year].
    "lyrics": [string],             # Lyrics of the song.
    "tags": [string array],         # Genre tags associated with artist of the song.
    "sentiment": {
        "neg": [float],             # Negativity assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% negative).
        "neu": [float],             # Neutrality assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% neutral).
        "pos": [float],             # Positivity assoc. w/ lyrics. (between 0-1 inclusive, 1 being 100% positive).
        "compound": [float]
    },
    "f_k_grade": [float],           # Flesch–Kincaid grade level of lyrics.
    "flesch_index": [float],        # Flesch reading ease score.
    "fog_index": [float],           # Gunning-Fog readability index.
    "difficult_words": [int],       # Number of words not on the Dale–Chall "easy" word list.
    "num_syllables": [int],         # Number of syllables in lyrics.
    "num_words": [int],             # Number of words in lyrics.
    "num_lines": [int],             # Number of lines in lyrics.
    "num_dupes": [int]              # Number of duplicate (repetitive) lines in lyrics.
}

Data Aggregation/filtering

Another python script aggregates our data by year for easy filtering in real-time in JavaScript. Our output JSON file looks similar to the above song format with the following structure:

{
    {
        "year": 1950,
        songs = {
            {song object 1},
            song object 2}},
            ...
    }
    {
        "year": 1951,
        songs = {
            {song object 1},
            song object 2}},
            ...
    }
    ...
}

This structure is cleaned and minimized and the original lyrics are removed to keep our file size under 2MB for nearly 5000 songs. Using underscore.js, we are able to utilize functional programming in JavaScript to very quickly filter and sort through our data. Using the above JSON notation with year-oriented objects allows us to filter through nearly 5000 songs in real-time on the user side in fractions of a second.

License

MIT © Kevin Schaich

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].