Feature Selection with Boruta Part 2: Model Building

Continuing on from Part 1, I build random forest models on the processed aptamer data with the goal of minimizing out-of-bag error rate, while also reducing the number of features from over a thousand to a more managable number. Boruta will allow us to do both for a 83% of molecules in the data.

Feature Selection with Boruta Part 1: Background and Data Prep

In 2016, I gave a talk on the Boruta algorithm for feature selection. Unlike most feature selection procedures, Boruta aims to find all relevant features in a given dataset, meaning all features that provide some level of information. Boruta is particularly useful for the problem of aptamer selection in bioinformatics, which is quite difficult because of the highly unusual structure of the data, and because the processed data has more columns than rows.

Musical Analysis of Rush Part 1: Musical Development Over Time

We will plot audio features obtained via the Spotify API to visually trace Rush's musical development over time, and identify ways in which individual Rush albums differ from the others. For example, I will show that Moving Pictures is not only the most instrumental Rush album, but it marks a peak at which Rush's music got less instrumental. As they added more synthesizers to fit in with their 80s contemporaries, Rush also added more vocals to the mix. However Rush has never been in danger of becoming an instrumental band, as we see no "Instrumentalness" values above 0.2. For reference, the highest possible "Instrumentalness" value is 1.

Musical Analysis of Rush Part 0: Download the Data

Rush is one of the most prolific progressive rock bands in history. They started out as a hard rock band in the vein of Led Zeppelin, then moved into progressive rock, then added synthesizers in the 80s, and added hard rock again in the 90s and 00s. This makes them a great candidate for a musical analysis.

Trusting the Black Box

The most important obstacle for wide adoption of self-driving cars is trust. Like nuclear energy, self-driving cars will be forever stigmatized if a single gruesome, highly publicized incident occurs. People feel a sense of control when they drive. They feel like they can always react to what other drivers are doing. Self-driving cars will necessarily give up this form of control, leading to some accidents that would be preventable if a person was driving. It is not surprising that 56% of people polled in a 2017 PEW survey said they would not want to ride in a self-driving car, and 72% of those said that their reasoning is due to safety concerns and lack of trust. The technology in self-driving cars requires hundreds of machine learning, computer vision, and robotics experts to develop. However trust, not technology, will be the primary factor in whether they become the crown jewel in an automated Second Industrial Revolution, or whether they go the way of Google Glass.

Why Every Data Scientist Should Know Command Line Tools

The UNIX command line is great for basic data processing tasks because it has very low latency. If you have a file with millions of rows, performing basic operations in a higher-level language requires reading the entire data file into memory. This can take unacceptably long amounts of time. With the command line, you can work on an entire file without worrying about your task taking hours because it is never necessary to read the entire file into memory.

Oversampling (Or why there’s no Democratic conspiracy in the polls)

One common complaint I’ve heard throughout the internet, mostly among people who think Clinton “rigged the election”, is that the polls were wrong due to oversampling. To people who do not know statistics, “oversampling” sounds like a conspiracy. It seems to imply that the Clinton campaign intentionally sampled too many black people and Hispanics in order to make it look like she had a greater chance of winning. However this is a fundamental misunderstanding of what oversampling is. Oversampling isn’t a way for pollsters to blind themselves about demographics and how they vote. In fact, it’s precisely the opposite.

The Question Concerning Technology (in the Statistics classroom)

In the past thirty years, cheap, ubiquitous computing power has allowed the field of statistics to address a wide variety of questions that previously would have been impractical. Any situation in which a closed-form expression for a particular quantity does not exist would have been virtually impossible to calculate by hand, and problems involving a large number of coefficients would have been unthinkable to solve. But computing has also made it easier than ever for anyone with little statistical understanding to use a statistical package, treat a procedure like a black box, and obtain a p-value without understanding the assumptions inherent in that procedure. How should the field of statistics teach technology to address both the increasing importance of computers and the dangers inherent in using them blindly?