Musical Analysis of Rush Part 0: Download the Data

July 15, 2018 analysis, tutorials

Introduction

Rush is one of the most prolific progressive rock bands in history. They started out as a hard rock band in the vein of Led Zeppelin, then moved into progressive rock, then added synthesizers in the 80s, and added hard rock again in the 90s and 00s.

This makes them a great candidate for a musical analysis! Besides their complex music development and versatility, other reasons to analyze Rush in particular are:

They are among my favorite bands and were my first ever concert back in 2007. So I have domain expertise.
All of their studio albums are on Spotify. This means there are no gaps. Do you hear me Robert Fripp?
With the exception of their debut and two tracks on Fly By Night, they have one lyricist.
Unlike some other candidates (coughmilesdaviscough) there are no albums by other artists that we need to consider, and there is no dispute about which albums to include. Rush’s 19 studio albums are canonically and unambiguously their 19 studio albums according to every fan on the planet.
Not a lot of data cleanup. There is no deduping necessary and a bare minimal amount of cleaning up text.

To Do:

In this RMarkdown document, we do the following:

Download Rush’s 19 canonical studio albums using the wonderful spotifyr package developed by Charlie Thompson
Do some basic cleanup
Export the resulting files to be analyzed

Data Prep

Spotify Developer Setup

Before you do anything, sign up to be a Spotify developer. You will be assigned a client ID and a client secret you need to query the Spotify API. Store those credentials in your .Renviron file as follows:

SPOTIFY_CLIENT_ID="{your_client_id}"
SPOTIFY_CLIENT_SECRET="{your_client_secret}"

Additionally, make sure you install the development version of spotifyr via the following command:

devtools::install_github('charlie86/spotifyr')

The version on CRAN is out of date and lacks many functions we will need.

Download and Cleanup

First we download the relevant data using spotifyr::get_discography(). It pulls all audio features for the artist’s discography from the Spotify API, along with lyrics from the Genius API using the geniusR package. If you’re wondering what audio features are, I will explain more in the first analysis I perform.

Because spotifyr::get_discography() takes a while to run, I wrapped it in a function that caches the original data file if it’s stored. As this is a one-off prep function that I have no intention of using anywhere else, I have made no effort to generalize it or prevent it from being used outside of its intended purpose.

Other packages I am using here:

readr - A package that reads and writes files better than base R’s equivalents. If it runs into parsing issues, it tells you which lines had the issues. Most importantly, it does not treat strings as factors!
dplyr - For tidy data manipulation. If you’ve never used dplyr before, what are you waiting for?!
lubridate - Functions for datetime manipulation
here - For working directory detection. A much better alternative to getwd.

library(readr)
library(dplyr)
library(spotifyr)  ## Development version
library(lubridate)
library(here)

source(here("lib", "vars.R"))  ## Contains paths we need for exporting

rush_studio_album_names <- c("rush", "fly by night",
"caress of steel", "2112", "a farewell to kings", "hemispheres",
"permanent waves", "moving pictures", "signals",
"grace under pressure", "power windows", "hold your fire", "presto",
"roll the bones", "counterparts", "test for echo", "vapor trails",
"snakes & arrows", "clockwork angels")

get_rush_dat <- function(input_file = ORIGINAL_DAT) {
    ## Look for the cached file because this takes a really long time to query
    
    if (file.exists(ORIGINAL_DAT)) {
        cat("Reading in file...\n")
        dat <- readRDS(ORIGINAL_DAT)
    } else {
        cat("Getting album data from Spotify...\n")
        dat <- get_discography("Rush") %>%
            ungroup()
        
        saveRDS(dat, ORIGINAL_DAT)
    }

    return(dat)
}    

original_rush_dat <- get_rush_dat()

## Reading in file...

original_rush_dat %>%
  glimpse

## Observations: 638
## Variables: 33
## $ artist_name            <chr> "Rush", "Rush", "Rush", "Rush", "Rush", "…
## $ artist_uri             <chr> "2Hkut4rAAyrQxRdof7FVJq", "2Hkut4rAAyrQxR…
## $ album_uri              <chr> "3U6vR85uJOAT08DLnJhZhH", "3U6vR85uJOAT08…
## $ album_name             <chr> "2112 - 40 Years Closer: A Q&A With Alex …
## $ album_img              <chr> "https://i.scdn.co/image/d633919ce5ff5a9e…
## $ album_type             <chr> "album", "album", "album", "album", "albu…
## $ is_collaboration       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ album_release_date     <chr> "2016-12-16", "2016-12-16", "2016-12-16",…
## $ album_release_year     <date> 2016-12-16, 2016-12-16, 2016-12-16, 2016…
## $ album_popularity       <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ track_name             <chr> "Terry Brown Intro - Commentary", "2112: …
## $ track_uri              <chr> "23DkB3Eb9vRo2PrfbdkUJR", "6rIc5dkUTYTtgg…
## $ track_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ disc_number            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ danceability           <dbl> 0.888, 0.640, 0.356, 0.718, 0.239, 0.748,…
## $ energy                 <dbl> 0.404, 0.333, 0.736, 0.441, 0.753, 0.324,…
## $ key                    <chr> "F", "A", "D", "D", "D", "B", "A", "C", "…
## $ loudness               <dbl> -13.999, -16.458, -9.257, -16.990, -9.258…
## $ mode                   <chr> "major", "major", "major", "major", "majo…
## $ speechiness            <dbl> 0.7940, 0.9580, 0.1080, 0.9540, 0.0886, 0…
## $ acousticness           <dbl> 8.04e-01, 7.35e-01, 8.13e-02, 7.18e-01, 3…
## $ instrumentalness       <dbl> 0.00e+00, 0.00e+00, 1.14e-03, 0.00e+00, 9…
## $ liveness               <dbl> 0.2370, 0.4230, 0.3130, 0.3860, 0.3760, 0…
## $ valence                <dbl> 0.8290, 0.5410, 0.2010, 0.5990, 0.5550, 0…
## $ tempo                  <dbl> 107.479, 87.368, 131.490, 122.509, 200.38…
## $ duration_ms            <dbl> 16480, 600547, 1237773, 95360, 215080, 57…
## $ time_signature         <dbl> 1, 4, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4, 5,…
## $ key_mode               <chr> "F major", "A major", "D major", "D major…
## $ track_popularity       <int> 0, 6, 11, 5, 12, 5, 10, 5, 8, 5, 8, 5, 9,…
## $ track_preview_url      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ track_open_spotify_url <chr> "https://open.spotify.com/track/23DkB3Eb9…
## $ track_n                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ lyrics                 <list> [NULL, NULL, NULL, NULL, NULL, NULL, NUL…

Phew! It worked. There’s even a lyrics column that contains the lyrics to every song in Rush’s discography. I have no idea why it’s NULL here and populated below.

Now that we have the data read in, let’s do some cleanup. I have gone back and changed this file a few times to reflect new cleaning that needed to be done. I will not be explaning these decisions in this document. Here’s what needs to be done:

Throw out irrelevant columns. This includes “liveness” (we’re only looking at studio albums), URLs we don’t need, and some columns where every value is the same.
Remove bonus tracks (which are often live)
The title track of 2112 has a bunch of subtitles, which makes plotting the name very difficult. I’ll remove those.
“Remastered” makes track names too long, so remove it.
Convert character string dates to Date objects
Move estimated release dates and actual release dates to the same column

rush_dat <- original_rush_dat %>% 
    filter(tolower(album_name) %in% rush_studio_album_names) %>% 
    filter(!grepl("- Live", track_name)) ## Remove bonus tracks

## Album and track info
rush_albums <- rush_dat %>%
    select(artist_name, artist_uri, starts_with("album"), track_name, track_uri, track_n, danceability:key_mode, lyrics) %>%
    select(-album_img, -album_type, -liveness) %>% ## remove redundant columns
    mutate(album_release_date = if_else(!is.na(ymd(album_release_date)), ymd(album_release_date), ymd(album_release_year))) %>% 
    select(-album_release_year) %>% 
    mutate(track_name = gsub(" - Remastered", "", .$track_name),
           track_name = if_else(grepl("^2112", .$track_name), "2112", track_name))

## Warning: 34 failed to parse.

## Warning: 34 failed to parse.

rush_albums %>%
  glimpse

## Observations: 164
## Variables: 23
## $ artist_name        <chr> "Rush", "Rush", "Rush", "Rush", "Rush", "Rush…
## $ artist_uri         <chr> "2Hkut4rAAyrQxRdof7FVJq", "2Hkut4rAAyrQxRdof7…
## $ album_uri          <chr> "744i0LypfMwHHrKhzsqAx0", "744i0LypfMwHHrKhzs…
## $ album_name         <chr> "Clockwork Angels", "Clockwork Angels", "Cloc…
## $ album_release_date <date> 2012-06-08, 2012-06-08, 2012-06-08, 2012-06-…
## $ album_popularity   <int> 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 4…
## $ track_name         <chr> "Caravan", "BU2B", "Clockwork Angels", "The A…
## $ track_uri          <chr> "43l8BalXmo4y50runkgJEh", "6CiPGcWJ3YykntxFja…
## $ track_n            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, …
## $ danceability       <dbl> 0.510, 0.336, 0.422, 0.455, 0.417, 0.427, 0.5…
## $ energy             <dbl> 0.917, 0.944, 0.952, 0.905, 0.920, 0.776, 0.9…
## $ key                <chr> "A", "A", "G", "G", "E", "G", "E", "A", "E", …
## $ loudness           <dbl> -5.469, -5.414, -6.232, -6.831, -7.049, -8.03…
## $ mode               <chr> "minor", "minor", "major", "major", "minor", …
## $ speechiness        <dbl> 0.0439, 0.0877, 0.1150, 0.0531, 0.1160, 0.045…
## $ acousticness       <dbl> 5.82e-04, 5.92e-04, 6.12e-05, 2.62e-05, 3.52e…
## $ instrumentalness   <dbl> 3.80e-03, 3.14e-02, 7.33e-03, 1.54e-01, 4.24e…
## $ valence            <dbl> 0.522, 0.384, 0.150, 0.632, 0.262, 0.379, 0.5…
## $ tempo              <dbl> 126.789, 151.564, 119.919, 139.944, 140.029, …
## $ duration_ms        <dbl> 339800, 310387, 451440, 411533, 291693, 19366…
## $ time_signature     <dbl> 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ key_mode           <chr> "A minor", "A minor", "G major", "G major", "…
## $ lyrics             <list> [<tbl_df[32 x 2]>, <tbl_df[53 x 2]>, <tbl_df…

Looks like everything worked well. release_date looks like a date and we don’t see any redundant columns. Let’s save it and move onto grabbing audio analysis info. We save as an RDS because even though feather is made for temporary caching, it can’t handle list columns.

rush_albums %>%
    saveRDS(OUTPUT_FEATURES)

Audio Analysis download

Spotify also has an Audio Analysis API, which is different from its Audio Features API. The Audio Analysis API contains in-depth features for individual segments of tracks. Once we start digging into the audio analysis features, I’ll explain more about what’s contained in them.

Like get_rush_dat, I’m wrapping the code to get this data in a function that caches the data if it doesn’t exist, and reads it in otherwise because this code takes an inordinately long amount of time to run. Here’s what the function does when you don’t have the audio analysis file cached:

For each track, do the following via purrr::map_df:

Get the audio analysis from the Spotify API.
Convert it to a tibble with each part of the audio analysis as a list column. Since there are seven parts that are always returned in a particular order, I just assign them labels in said order. This gives us a tidy tibble where every row corresponds to an audio analysis feature of a given track.
Add the audio analysis info back to the original data frame, and throw out any information that isn’t track and album metadata. This is to conserve space, since if we want to combine audio analysis with track-level audio features, we can easily just join to the original data using the *_uri columns.

If it’s your first time running this function, it will probably take at least ten minutes. To help the process along (and because it’s just really fun), I run beepr::beep once the data is done being prepared. It’s a wonderful function that can play any of a variety of sound effects when it’s finished. I personally have it set to the Super Mario Bros. End of Level sound.

library(jsonlite)
library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:jsonlite':
## 
##     flatten

library(beepr)

get_audio_analysis_df <- function(dat, output_file = OUTPUT_AUDIO_ANALYSIS) {
  
    if (file.exists(output_file)) {
      cat("Audio analysis file exists.  Reading...\n")
      audio_analysis_df <- readRDS(output_file)
    } else {
      cat("Audio analysis file does not exist.  Generating now\n")
      audio_analysis_df <- map_df(dat$track_uri, function(x) {
          audio_analysis <- get_track_audio_analysis(x)    
          return(tibble(track_uri = x,
                        audio_analysis = audio_analysis,
                        content_type = c("meta", "track", "bars", "beats",
                                         "tatums", "sections", "segments")))
      })
  
      album_track_info <- dat %>%
          select(album_uri, album_name, track_name, track_uri, track_n)
  
      audio_analysis_col <- audio_analysis_df %>% select(audio_analysis)
  
      audio_analysis_df <- audio_analysis_df %>%
          select(-audio_analysis) %>%  ## to avoid issues with distinct
          inner_join(album_track_info, by = "track_uri") %>%
          distinct() %>% 
          bind_cols(audio_analysis_col)
      
      audio_analysis_df %>%
        saveRDS(output_file)
      
      beep(8)
    
    }
  
  return(audio_analysis_df)
    
}

rush_audio_analysis <- get_audio_analysis_df(rush_albums)

## Audio analysis file exists.  Reading...

rush_audio_analysis %>%
  glimpse()

## Observations: 1,148
## Variables: 7
## $ track_uri      <chr> "43l8BalXmo4y50runkgJEh", "43l8BalXmo4y50runkgJEh…
## $ content_type   <chr> "meta", "track", "bars", "beats", "tatums", "sect…
## $ album_uri      <chr> "744i0LypfMwHHrKhzsqAx0", "744i0LypfMwHHrKhzsqAx0…
## $ album_name     <chr> "Clockwork Angels", "Clockwork Angels", "Clockwor…
## $ track_name     <chr> "Caravan", "Caravan", "Caravan", "Caravan", "Cara…
## $ track_n        <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3…
## $ audio_analysis <list> [["4.0.0", "Linux", "OK", 0, 1444635788, 15.1851…

And there we have it! Everything has been cleaned up and exported succesfully.

Next Steps

In the next notebook, we will make some plots of musical features. We’ll show Rush’s musical development over time using a variety of useful ggplot2 extensions.

Question

I know many of you are questioning why I called this Part 0. I did this for two reasons:

I originally wrote this as a script and didn’t want to rename some of the other files with number prefixes I wrote.
Since the file starting with 1 is where the analysis begins, I started this file with 0 because this notebook involves no analysis whatsoever.