Feature Selection with Boruta Part 2: Model Building

Continuing on from Part 1, I build random forest models on the processed aptamer data with the goal of minimizing out-of-bag error rate, while also reducing the number of features from over a thousand to a more managable number. Boruta will allow us to do both for a 83% of molecules in the data.

Feature Selection with Boruta Part 1: Background and Data Prep

In 2016, I gave a talk on the Boruta algorithm for feature selection. Unlike most feature selection procedures, Boruta aims to find all relevant features in a given dataset, meaning all features that provide some level of information. Boruta is particularly useful for the problem of aptamer selection in bioinformatics, which is quite difficult because of the highly unusual structure of the data, and because the processed data has more columns than rows.

Musical Analysis of Rush Part 1: Musical Development Over Time

We will plot audio features obtained via the Spotify API to visually trace Rush's musical development over time, and identify ways in which individual Rush albums differ from the others. For example, I will show that Moving Pictures is not only the most instrumental Rush album, but it marks a peak at which Rush's music got less instrumental. As they added more synthesizers to fit in with their 80s contemporaries, Rush also added more vocals to the mix. However Rush has never been in danger of becoming an instrumental band, as we see no "Instrumentalness" values above 0.2. For reference, the highest possible "Instrumentalness" value is 1.

Musical Analysis of Rush Part 0: Download the Data

Rush is one of the most prolific progressive rock bands in history. They started out as a hard rock band in the vein of Led Zeppelin, then moved into progressive rock, then added synthesizers in the 80s, and added hard rock again in the 90s and 00s. This makes them a great candidate for a musical analysis.

Oversampling (Or why there’s no Democratic conspiracy in the polls)

One common complaint I’ve heard throughout the internet, mostly among people who think Clinton “rigged the election”, is that the polls were wrong due to oversampling. To people who do not know statistics, “oversampling” sounds like a conspiracy. It seems to imply that the Clinton campaign intentionally sampled too many black people and Hispanics in order to make it look like she had a greater chance of winning. However this is a fundamental misunderstanding of what oversampling is. Oversampling isn’t a way for pollsters to blind themselves about demographics and how they vote. In fact, it’s precisely the opposite.