Feature Selection with Boruta Part 2: Model Building

Introduction

In Part 1, we prepared the data. Now how well does Boruta work?

As a reminder, the goal is to detect gene sequences that potentially identify whether a molecule is an aptamer, which is a class of short DNA or RNA sequences with clinical and research applications. This is a difficult problem because each feature is a three, four, or five-letter genetic sequence with no obvious prior indication as to which is important. It is also a small-\(n\) large-\(p\) problem because the data sets we analyze have many more features than observations.

We will see how well Boruta works in aptamer detection for 23 different molecules. Because this involves the same work on every molecule’s data, we will liberally use functions from the purrr package to make our lives easier.

Prep

First, some basic prep. Because this is part of an RStudio Project, the here package allows us to easily identify the project directory.

library(Boruta)
## Loading required package: ranger
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
## 
##     importance
library(here)
## here() starts at /Users/brandonsherman/shermstats/content/post/boruta
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()  masks randomForest::combine()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ ggplot2::margin() masks randomForest::margin()
library(patchwork)

SEED <- 314

input_file <- here::here("data", "atp.rds")

kmer_df <- readRDS(input_file)
kmer_df$is_aptamer <- as.factor(kmer_df$is_aptamer)

head(kmer_df)
## # A tibble: 6 x 1,347
##   molecule is_aptamer sequence   AAA   AAC   AAG   AAT   ACA   ACC   ACG
##   <chr>    <fct>      <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATP      1          TCTGTTC…     0     0     1     1     1     0     0
## 2 ATP      1          GCTGAGG…     0     0     1     1     0     1     1
## 3 ATP      1          TCTGTTG…     1     0     1     1     0     0     0
## 4 ATP      1          GGACTTC…     1     1     1     0     1     1     0
## 5 ATP      1          TCTGCGC…     0     0     1     1     1     1     0
## 6 ATP      1          GCTGAGG…     0     0     1     1     0     1     1
## # … with 1,337 more variables: ACT <dbl>, AGA <dbl>, AGC <dbl>, AGG <dbl>,
## #   AGT <dbl>, ATA <dbl>, ATC <dbl>, ATG <dbl>, ATT <dbl>, CAA <dbl>,
## #   CAC <dbl>, CAG <dbl>, CAT <dbl>, CCA <dbl>, CCC <dbl>, CCG <dbl>,
## #   CCT <dbl>, CGA <dbl>, CGC <dbl>, CGG <dbl>, CGT <dbl>, CTA <dbl>,
## #   CTC <dbl>, CTG <dbl>, CTT <dbl>, GAA <dbl>, GAC <dbl>, GAG <dbl>,
## #   GAT <dbl>, GCA <dbl>, GCC <dbl>, GCG <dbl>, GCT <dbl>, GGA <dbl>,
## #   GGC <dbl>, GGG <dbl>, GGT <dbl>, GTA <dbl>, GTC <dbl>, GTG <dbl>,
## #   GTT <dbl>, TAA <dbl>, TAC <dbl>, TAG <dbl>, TAT <dbl>, TCA <dbl>,
## #   TCC <dbl>, TCG <dbl>, TCT <dbl>, TGA <dbl>, TGC <dbl>, TGG <dbl>,
## #   TGT <dbl>, TTA <dbl>, TTC <dbl>, TTG <dbl>, TTT <dbl>, AAAA <dbl>,
## #   AAAC <dbl>, AAAG <dbl>, AAAT <dbl>, AACA <dbl>, AACC <dbl>,
## #   AACG <dbl>, AACT <dbl>, AAGA <dbl>, AAGC <dbl>, AAGG <dbl>,
## #   AAGT <dbl>, AATA <dbl>, AATC <dbl>, AATG <dbl>, AATT <dbl>,
## #   ACAA <dbl>, ACAC <dbl>, ACAG <dbl>, ACAT <dbl>, ACCA <dbl>,
## #   ACCC <dbl>, ACCG <dbl>, ACCT <dbl>, ACGA <dbl>, ACGC <dbl>,
## #   ACGG <dbl>, ACGT <dbl>, ACTA <dbl>, ACTC <dbl>, ACTG <dbl>,
## #   ACTT <dbl>, AGAA <dbl>, AGAC <dbl>, AGAG <dbl>, AGAT <dbl>,
## #   AGCA <dbl>, AGCC <dbl>, AGCG <dbl>, AGCT <dbl>, AGGA <dbl>,
## #   AGGC <dbl>, AGGG <dbl>, …

Modeling

We will follow the Boruta authors’ guide and do the following modeling for each molecule:

  1. Fit a random forest on the full set of 1344 predictors.
  2. Run Boruta.
  3. Fit a new random forest on the reduced set of predictors identified by Boruta.
  4. Compare out-of-bag error across the two models.

Build Models

Initial Random Forest and Boruta

First we define our model functions to be applied to every dataset. aptamer_random_forest will build a random forest on a particular aptamer dataset, where the response is whether the sequence is an aptamer, and the predictors are the presence and absence of a given k-mer. aptamer_boruta is similar, but it runs Boruta instead.

aptamer_random_forest <- function(dat) {
  ## Explicitly specify X and y to avoid scoping 
  ## issues with formulas
  
  X <- dat %>% select(-is_aptamer)
  y <- dat$is_aptamer
  
  randomForest(X, y)
}

aptamer_boruta <- function(dat) {
  
  X <- dat %>% select(-is_aptamer)
  y <- dat$is_aptamer
  
  Boruta(X, y, doTrace = 1)
}

Now let’s build the models. Every output we wish to keep here will be appended as a column via purrr::map. map takes in data and a function, and returns an output for every observation. Its siblings, like map_int and map_dbl, do the same thing but return an int or a double, respectively (and if they can’t, return an error).

The below block of code nests each molecule’s data into a list column, runs a separate random forest model (with default parameters) on each nested data.frame, and then runs a Boruta model on the same data. I have intentionally included the Boruta output because it provides a play-by-play of what the algorithm is doing. But because there’s a lot of it, I suggest looking over some of it and clicking this link to be taken to the next section.

## message=FALSE to avoid excessive output when compiled, but it will 
## show up if you run the code in this chunk individually.
set.seed(SEED)
model_df <- kmer_df %>%
  select(-sequence) %>%
  nest(-molecule) %>%
  mutate(rf_orig_fit = map(data, aptamer_random_forest))

set.seed(SEED)
model_df <- model_df %>%  ## Separate in case randomization is different for the Boruta fit
              mutate(boruta_fit = map(data, aptamer_boruta))
## After 18 iterations, +23 secs:
##  confirmed 6 attributes: AATT, AATTG, ATTGC, CTGAG, GAAGA and 1 more;
##  rejected 1233 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1228 more;
##  still have 105 attributes left.
## After 22 iterations, +24 secs:
##  confirmed 4 attributes: GAGAT, GGAAG, GTAAT, TAAT;
##  rejected 17 attributes: AGACA, AGACC, AGTG, ATT, CAGGA and 12 more;
##  still have 84 attributes left.
## After 26 iterations, +24 secs:
##  confirmed 3 attributes: ATTGA, CTGC, GCTGA;
##  rejected 5 attributes: CCGTC, CGTA, GTCCG, TAT, TTGA;
##  still have 76 attributes left.
## After 30 iterations, +25 secs:
##  rejected 1 attribute: GTGCG;
##  still have 75 attributes left.
## After 33 iterations, +25 secs:
##  confirmed 4 attributes: AAGA, ATTG, CCGT, TTGAG;
##  rejected 6 attributes: AAT, AGATT, AGGCT, CTCTT, GCTCT and 1 more;
##  still have 65 attributes left.
## After 36 iterations, +25 secs:
##  confirmed 3 attributes: AAGAC, AGGGC, TGAGA;
##  rejected 2 attributes: CCG, TTGC;
##  still have 60 attributes left.
## After 39 iterations, +25 secs:
##  confirmed 1 attribute: TGAGT;
##  still have 59 attributes left.
## After 43 iterations, +26 secs:
##  rejected 2 attributes: CGTCC, GATT;
##  still have 57 attributes left.
## After 49 iterations, +26 secs:
##  confirmed 1 attribute: AGA;
##  rejected 5 attributes: AGATC, GAGTG, GCGG, TAA, TGGG;
##  still have 51 attributes left.
## After 51 iterations, +26 secs:
##  confirmed 2 attributes: CAT, CGTAA;
##  still have 49 attributes left.
## After 54 iterations, +27 secs:
##  rejected 1 attribute: TTGCC;
##  still have 48 attributes left.
## After 63 iterations, +27 secs:
##  confirmed 1 attribute: CAACT;
##  still have 47 attributes left.
## After 66 iterations, +28 secs:
##  rejected 2 attributes: CCGGG, CCGTA;
##  still have 45 attributes left.
## After 68 iterations, +28 secs:
##  confirmed 1 attribute: GATTG;
##  still have 44 attributes left.
## After 74 iterations, +28 secs:
##  confirmed 1 attribute: GCGGC;
##  still have 43 attributes left.
## After 76 iterations, +28 secs:
##  confirmed 1 attribute: CTGCT;
##  still have 42 attributes left.
## After 82 iterations, +29 secs:
##  confirmed 1 attribute: GCCGT;
##  still have 41 attributes left.
## After 92 iterations, +30 secs:
##  confirmed 3 attributes: ATC, ATG, GGAA;
##  still have 38 attributes left.
## After 95 iterations, +30 secs:
##  confirmed 1 attribute: GAGGG;
##  still have 37 attributes left.
## After 97 iterations, +30 secs:
##  confirmed 2 attributes: GAAG, TGAG;
##  still have 35 attributes left.
## After 18 iterations, +18 secs:
##  confirmed 5 attributes: GCGG, GCGGT, GGTTG, GTTGA, TTGA;
##  rejected 1307 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1302 more;
##  still have 32 attributes left.
## After 22 iterations, +18 secs:
##  confirmed 1 attribute: GTTG;
##  rejected 2 attributes: TAA, TACT;
##  still have 29 attributes left.
## After 26 iterations, +19 secs:
##  rejected 4 attributes: CGG, GCTA, GTT, TCGAA;
##  still have 25 attributes left.
## After 33 iterations, +19 secs:
##  confirmed 1 attribute: TGA;
##  still have 24 attributes left.
## After 36 iterations, +19 secs:
##  confirmed 2 attributes: CAAT, CGGT;
##  rejected 1 attribute: CGCGG;
##  still have 21 attributes left.
## After 46 iterations, +19 secs:
##  confirmed 1 attribute: CGGTT;
##  still have 20 attributes left.
## After 51 iterations, +19 secs:
##  confirmed 1 attribute: GTAGA;
##  still have 19 attributes left.
## After 63 iterations, +20 secs:
##  confirmed 1 attribute: TCGA;
##  still have 18 attributes left.
## After 66 iterations, +20 secs:
##  confirmed 2 attributes: GTCGA, TGAC;
##  still have 16 attributes left.
## After 68 iterations, +20 secs:
##  confirmed 1 attribute: TTG;
##  still have 15 attributes left.
## After 71 iterations, +20 secs:
##  confirmed 1 attribute: TCGG;
##  still have 14 attributes left.
## After 82 iterations, +20 secs:
##  confirmed 2 attributes: TAGA, TTGAC;
##  still have 12 attributes left.
## After 95 iterations, +21 secs:
##  rejected 1 attribute: GGA;
##  still have 11 attributes left.
## After 18 iterations, +20 secs:
##  confirmed 3 attributes: AAGGG, AGGG, TGTT;
##  rejected 1300 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1295 more;
##  still have 41 attributes left.
## After 22 iterations, +21 secs:
##  rejected 9 attributes: AAAGG, ACA, ACAA, AGGTG, ATGAG and 4 more;
##  still have 32 attributes left.
## After 26 iterations, +21 secs:
##  rejected 1 attribute: AGG;
##  still have 31 attributes left.
## After 39 iterations, +22 secs:
##  confirmed 2 attributes: AGGGT, AGGT;
##  still have 29 attributes left.
## After 60 iterations, +22 secs:
##  confirmed 1 attribute: AGAT;
##  rejected 2 attributes: GAGGA, GTGCT;
##  still have 26 attributes left.
## After 63 iterations, +23 secs:
##  rejected 1 attribute: TGGAA;
##  still have 25 attributes left.
## After 68 iterations, +23 secs:
##  confirmed 1 attribute: ATTAG;
##  still have 24 attributes left.
## After 74 iterations, +23 secs:
##  confirmed 1 attribute: GTC;
##  still have 23 attributes left.
## After 90 iterations, +24 secs:
##  confirmed 1 attribute: GGGT;
##  still have 22 attributes left.
## After 18 iterations, +22 secs:
##  confirmed 1 attribute: CAGC;
##  rejected 1314 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1309 more;
##  still have 29 attributes left.
## After 22 iterations, +22 secs:
##  rejected 3 attributes: CGT, GTGTG, TTTCG;
##  still have 26 attributes left.
## After 30 iterations, +22 secs:
##  confirmed 1 attribute: CAGCA;
##  still have 25 attributes left.
## After 39 iterations, +23 secs:
##  confirmed 1 attribute: GGGTA;
##  still have 24 attributes left.
## After 43 iterations, +23 secs:
##  confirmed 2 attributes: AGAGT, GGTC;
##  still have 22 attributes left.
## After 49 iterations, +23 secs:
##  confirmed 1 attribute: AGCAC;
##  rejected 1 attribute: TGTGC;
##  still have 20 attributes left.
## After 51 iterations, +23 secs:
##  confirmed 1 attribute: CGTGT;
##  still have 19 attributes left.
## After 60 iterations, +24 secs:
##  confirmed 1 attribute: TAGTG;
##  still have 18 attributes left.
## After 68 iterations, +24 secs:
##  confirmed 2 attributes: AGGGT, GAGA;
##  still have 16 attributes left.
## After 74 iterations, +24 secs:
##  confirmed 1 attribute: TCAA;
##  still have 15 attributes left.
## After 76 iterations, +24 secs:
##  confirmed 1 attribute: ACATA;
##  still have 14 attributes left.
## After 79 iterations, +25 secs:
##  confirmed 1 attribute: CCGCG;
##  still have 13 attributes left.
## After 82 iterations, +25 secs:
##  confirmed 1 attribute: AGCTA;
##  still have 12 attributes left.
## After 84 iterations, +25 secs:
##  confirmed 1 attribute: GAGAC;
##  still have 11 attributes left.
## After 90 iterations, +25 secs:
##  confirmed 1 attribute: GAGGC;
##  still have 10 attributes left.
## After 95 iterations, +25 secs:
##  confirmed 1 attribute: GGGTC;
##  still have 9 attributes left.
## After 18 iterations, +23 secs:
##  rejected 1309 attributes: AAA, AAAAA, AAAAC, AAAAG, AAAAT and 1304 more;
##  still have 35 attributes left.
## After 22 iterations, +23 secs:
##  confirmed 1 attribute: GTGTG;
##  rejected 6 attributes: ACGTT, AGAC, ATGC, CTCG, GGTGT and 1 more;
##  still have 28 attributes left.
## After 26 iterations, +24 secs:
##  confirmed 1 attribute: ATGCT;
##  rejected 1 attribute: CATC;
##  still have 26 attributes left.
## After 30 iterations, +24 secs:
##  rejected 1 attribute: AGC;
##  still have 25 attributes left.
## After 39 iterations, +24 secs:
##  rejected 1 attribute: ATCTC;
##  still have 24 attributes left.
## After 43 iterations, +24 secs:
##  confirmed 1 attribute: CATGC;
##  still have 23 attributes left.
## After 57 iterations, +25 secs:
##  confirmed 1 attribute: GTAAG;
##  still have 22 attributes left.
## After 60 iterations, +25 secs:
##  confirmed 1 attribute: TATC;
##  still have 21 attributes left.
## After 63 iterations, +25 secs:
##  confirmed 1 attribute: ATAC;
##  still have 20 attributes left.
## After 66 iterations, +25 secs:
##  rejected 1 attribute: CAGG;
##  still have 19 attributes left.
## After 68 iterations, +25 secs:
##  confirmed 2 attributes: AAAA, CATA;
##  still have 17 attributes left.
## After 76 iterations, +26 secs:
##  confirmed 1 attribute: TCGA;
##  still have 16 attributes left.
## After 82 iterations, +26 secs:
##  confirmed 1 attribute: CAC;
##  still have 15 attributes left.
## After 84 iterations, +26 secs:
##  confirmed 1 attribute: GCC;
##  still have 14 attributes left.
## After 87 iterations, +26 secs:
##  confirmed 1 attribute: CATG;
##  still have 13 attributes left.
## After 18 iterations, +19 secs:
##  confirmed 5 attributes: AAGA, AAGAA, AGAA, CAAGA, TGCT;
##  rejected 1316 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1311 more;
##  still have 23 attributes left.
## After 22 iterations, +19 secs:
##  confirmed 2 attributes: AGAAA, CAAG;
##  rejected 4 attributes: CGTTG, GAAA, GAACT, GCT;
##  still have 17 attributes left.
## After 43 iterations, +20 secs:
##  confirmed 1 attribute: GATGC;
##  still have 16 attributes left.
## After 46 iterations, +20 secs:
##  confirmed 1 attribute: TGCTA;
##  still have 15 attributes left.
## After 49 iterations, +20 secs:
##  confirmed 1 attribute: ATGC;
##  still have 14 attributes left.
## After 57 iterations, +20 secs:
##  confirmed 3 attributes: CAAGG, TACAA, TCCAA;
##  still have 11 attributes left.
## After 60 iterations, +20 secs:
##  rejected 1 attribute: TGCTC;
##  still have 10 attributes left.
## After 68 iterations, +20 secs:
##  confirmed 1 attribute: CTG;
##  still have 9 attributes left.
## After 71 iterations, +21 secs:
##  confirmed 1 attribute: TTACA;
##  still have 8 attributes left.
## After 74 iterations, +21 secs:
##  confirmed 1 attribute: CCAAG;
##  still have 7 attributes left.
## After 84 iterations, +21 secs:
##  confirmed 1 attribute: GTGA;
##  still have 6 attributes left.
## After 18 iterations, +25 secs:
##  confirmed 1 attribute: ACTC;
##  rejected 1302 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1297 more;
##  still have 41 attributes left.
## After 22 iterations, +25 secs:
##  confirmed 1 attribute: TTCTC;
##  rejected 5 attributes: ACT, CACTG, CCTTC, GCC, TTTTT;
##  still have 35 attributes left.
## After 30 iterations, +25 secs:
##  confirmed 2 attributes: ACTCC, CTTA;
##  rejected 1 attribute: TAA;
##  still have 32 attributes left.
## After 36 iterations, +26 secs:
##  confirmed 2 attributes: CTCCC, TCCCA;
##  still have 30 attributes left.
## After 43 iterations, +26 secs:
##  rejected 1 attribute: TCACC;
##  still have 29 attributes left.
## After 46 iterations, +26 secs:
##  confirmed 1 attribute: GCAG;
##  still have 28 attributes left.
## After 54 iterations, +27 secs:
##  confirmed 2 attributes: TCTC, TCTCC;
##  still have 26 attributes left.
## After 57 iterations, +27 secs:
##  confirmed 2 attributes: CTCC, TCCA;
##  still have 24 attributes left.
## After 60 iterations, +27 secs:
##  rejected 1 attribute: TCCT;
##  still have 23 attributes left.
## After 71 iterations, +27 secs:
##  rejected 2 attributes: CAG, GATC;
##  still have 21 attributes left.
## After 74 iterations, +27 secs:
##  confirmed 1 attribute: CTCTC;
##  still have 20 attributes left.
## After 76 iterations, +28 secs:
##  confirmed 1 attribute: TCTCT;
##  still have 19 attributes left.
## After 84 iterations, +28 secs:
##  confirmed 1 attribute: GGATA;
##  still have 18 attributes left.
## After 95 iterations, +28 secs:
##  rejected 1 attribute: ACCC;
##  still have 17 attributes left.
## After 18 iterations, +14 secs:
##  confirmed 1 attribute: GGGA;
##  rejected 1322 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1317 more;
##  still have 21 attributes left.
## After 22 iterations, +14 secs:
##  rejected 2 attributes: CGTAG, GGGAT;
##  still have 19 attributes left.
## After 60 iterations, +15 secs:
##  rejected 1 attribute: AGGG;
##  still have 18 attributes left.
## After 76 iterations, +16 secs:
##  confirmed 1 attribute: GCGT;
##  still have 17 attributes left.
## After 92 iterations, +16 secs:
##  confirmed 1 attribute: AAGA;
##  still have 16 attributes left.
## After 95 iterations, +17 secs:
##  confirmed 1 attribute: GGCGT;
##  still have 15 attributes left.
## After 18 iterations, +23 secs:
##  confirmed 5 attributes: AATCA, CAT, TCA, TCAC, TCAT;
##  rejected 1294 attributes: AAAA, AAAAA, AAAAC, AAAAG, AAAAT and 1289 more;
##  still have 45 attributes left.
## After 22 iterations, +23 secs:
##  confirmed 5 attributes: ACTCA, ATCA, ATCAC, GATTC, GTCAT;
##  rejected 6 attributes: AAC, ACTAT, TATG, TGAGT, TTT and 1 more;
##  still have 34 attributes left.
## After 26 iterations, +23 secs:
##  confirmed 1 attribute: TGATT;
##  rejected 2 attributes: CAC, TCATC;
##  still have 31 attributes left.
## After 30 iterations, +24 secs:
##  confirmed 2 attributes: ATGA, TCATA;
##  rejected 4 attributes: AAA, CACT, TCGA, TTCAT;
##  still have 25 attributes left.
## After 33 iterations, +24 secs:
##  confirmed 1 attribute: TCACT;
##  still have 24 attributes left.
## After 36 iterations, +24 secs:
##  rejected 1 attribute: AGTA;
##  still have 23 attributes left.
## After 39 iterations, +24 secs:
##  confirmed 1 attribute: TACTC;
##  still have 22 attributes left.
## After 43 iterations, +24 secs:
##  confirmed 1 attribute: AGTCA;
##  still have 21 attributes left.
## After 51 iterations, +24 secs:
##  confirmed 1 attribute: TGAT;
##  still have 20 attributes left.
## After 54 iterations, +25 secs:
##  confirmed 2 attributes: ATTC, CTCA;
##  still have 18 attributes left.
## After 57 iterations, +25 secs:
##  confirmed 1 attribute: AATC;
##  still have 17 attributes left.
## After 63 iterations, +25 secs:
##  rejected 1 attribute: ACAT;
##  still have 16 attributes left.
## After 76 iterations, +25 secs:
##  confirmed 1 attribute: TTACT;
##  still have 15 attributes left.
## After 97 iterations, +26 secs:
##  rejected 1 attribute: ATCAT;
##  still have 14 attributes left.
## After 18 iterations, +1.2 mins:
##  confirmed 34 attributes: ACG, ACGC, ACGT, ACGTA, ATT and 29 more;
##  rejected 1173 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1168 more;
##  still have 137 attributes left.
## After 22 iterations, +1.2 mins:
##  confirmed 4 attributes: CCTGG, CGTCG, GCACG, TGG;
##  rejected 9 attributes: AATGC, ATCAC, CTC, CTGTT, TACGG and 4 more;
##  still have 124 attributes left.
## After 26 iterations, +1.2 mins:
##  confirmed 7 attributes: AAGTT, AATG, ACGCA, AGTTG, GCATG and 2 more;
##  rejected 5 attributes: CCGG, GAGG, GGGA, TAG, TGTA;
##  still have 112 attributes left.
## After 30 iterations, +1.3 mins:
##  confirmed 2 attributes: CATGC, TAAT;
##  rejected 3 attributes: CAGCA, GGGAT, GTTC;
##  still have 107 attributes left.
## After 33 iterations, +1.3 mins:
##  confirmed 2 attributes: GGGCC, GTGGG;
##  rejected 7 attributes: AAGC, AAGT, ACGA, ACGAG, CTG and 2 more;
##  still have 98 attributes left.
## After 36 iterations, +1.3 mins:
##  confirmed 3 attributes: GAAA, GAAAG, TCCGG;
##  still have 95 attributes left.
## After 39 iterations, +1.3 mins:
##  rejected 1 attribute: GTTGG;
##  still have 94 attributes left.
## After 43 iterations, +1.3 mins:
##  confirmed 7 attributes: AGTGG, CACGC, CATCT, CCGTA, GCCTG and 2 more;
##  still have 87 attributes left.
## After 46 iterations, +1.4 mins:
##  confirmed 4 attributes: AGGAG, CGTAT, TCG, TGCCT;
##  still have 83 attributes left.
## After 49 iterations, +1.4 mins:
##  confirmed 1 attribute: AGCGT;
##  rejected 4 attributes: AGGT, ATGA, CCT, CGAC;
##  still have 78 attributes left.
## After 51 iterations, +1.4 mins:
##  confirmed 2 attributes: GTAA, TACGC;
##  rejected 2 attributes: AAC, TCT;
##  still have 74 attributes left.
## After 54 iterations, +1.4 mins:
##  confirmed 1 attribute: ACCGT;
##  still have 73 attributes left.
## After 57 iterations, +1.4 mins:
##  confirmed 3 attributes: CGGT, TTC, TTCG;
##  still have 70 attributes left.
## After 60 iterations, +1.5 mins:
##  confirmed 1 attribute: AAAGT;
##  rejected 1 attribute: TCGG;
##  still have 68 attributes left.
## After 63 iterations, +1.5 mins:
##  confirmed 1 attribute: GGAGC;
##  rejected 1 attribute: GCTAT;
##  still have 66 attributes left.
## After 66 iterations, +1.5 mins:
##  confirmed 2 attributes: AATGA, CGACC;
##  still have 64 attributes left.
## After 74 iterations, +1.6 mins:
##  confirmed 2 attributes: AGT, TAGT;
##  still have 62 attributes left.
## After 76 iterations, +1.6 mins:
##  confirmed 6 attributes: AGTT, CACAG, CGCAT, GCAT, GCCT and 1 more;
##  rejected 1 attribute: TGGGG;
##  still have 55 attributes left.
## After 82 iterations, +1.6 mins:
##  rejected 1 attribute: CAGG;
##  still have 54 attributes left.
## After 84 iterations, +1.6 mins:
##  rejected 1 attribute: TTCT;
##  still have 53 attributes left.
## After 87 iterations, +1.7 mins:
##  confirmed 2 attributes: AAG, GGTA;
##  still have 51 attributes left.
## After 90 iterations, +1.7 mins:
##  confirmed 2 attributes: ATGCC, GGT;
##  still have 49 attributes left.
## After 92 iterations, +1.7 mins:
##  confirmed 1 attribute: ATG;
##  still have 48 attributes left.
## After 95 iterations, +1.7 mins:
##  rejected 1 attribute: CACTG;
##  still have 47 attributes left.
## After 97 iterations, +1.7 mins:
##  confirmed 1 attribute: GTACG;
##  still have 46 attributes left.
## After 18 iterations, +21 secs:
##  rejected 1328 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1323 more;
##  still have 16 attributes left.
## After 22 iterations, +21 secs:
##  confirmed 1 attribute: CTCTC;
##  still have 15 attributes left.
## After 26 iterations, +21 secs:
##  rejected 1 attribute: TATC;
##  still have 14 attributes left.
## After 30 iterations, +21 secs:
##  rejected 1 attribute: GTG;
##  still have 13 attributes left.
## After 43 iterations, +22 secs:
##  confirmed 1 attribute: CCGT;
##  still have 12 attributes left.
## After 51 iterations, +22 secs:
##  confirmed 1 attribute: ATCC;
##  still have 11 attributes left.
## After 60 iterations, +22 secs:
##  rejected 1 attribute: AGCT;
##  still have 10 attributes left.
## After 63 iterations, +22 secs:
##  rejected 1 attribute: CCCG;
##  still have 9 attributes left.
## After 71 iterations, +22 secs:
##  rejected 1 attribute: CCTCT;
##  still have 8 attributes left.
## After 79 iterations, +23 secs:
##  confirmed 1 attribute: TGTG;
##  still have 7 attributes left.
## After 84 iterations, +23 secs:
##  confirmed 1 attribute: GGG;
##  still have 6 attributes left.
## After 18 iterations, +15 secs:
##  confirmed 11 attributes: AAACT, AACT, AACTC, AAGA, AGAA and 6 more;
##  rejected 1259 attributes: AAAA, AAAAA, AAAAC, AAAAT, AAACA and 1254 more;
##  still have 74 attributes left.
## After 22 iterations, +16 secs:
##  confirmed 1 attribute: ACTC;
##  rejected 15 attributes: ACTGT, AGA, ATA, CACT, CCCAG and 10 more;
##  still have 58 attributes left.
## After 26 iterations, +16 secs:
##  confirmed 2 attributes: ACTTC, GCAC;
##  rejected 2 attributes: ACT, CAAAA;
##  still have 54 attributes left.
## After 30 iterations, +16 secs:
##  confirmed 1 attribute: TGGC;
##  rejected 3 attributes: CAT, GCACT, GTTG;
##  still have 50 attributes left.
## After 33 iterations, +16 secs:
##  confirmed 1 attribute: TCGG;
##  rejected 5 attributes: CAGG, CGA, CTGTG, GCCA, TGTG;
##  still have 44 attributes left.
## After 36 iterations, +17 secs:
##  confirmed 1 attribute: GGAA;
##  still have 43 attributes left.
## After 39 iterations, +17 secs:
##  confirmed 1 attribute: AAGAC;
##  rejected 1 attribute: CTGT;
##  still have 41 attributes left.
## After 43 iterations, +17 secs:
##  confirmed 2 attributes: CGGTG, GGAC;
##  rejected 2 attributes: AGCAA, CGT;
##  still have 37 attributes left.
## After 49 iterations, +17 secs:
##  rejected 1 attribute: AAAAG;
##  still have 36 attributes left.
## After 54 iterations, +17 secs:
##  confirmed 1 attribute: TTC;
##  rejected 1 attribute: GAAA;
##  still have 34 attributes left.
## After 60 iterations, +18 secs:
##  confirmed 3 attributes: GGCAC, TCGGT, TGTGG;
##  still have 31 attributes left.
## After 74 iterations, +18 secs:
##  confirmed 1 attribute: AAAC;
##  still have 30 attributes left.
## After 76 iterations, +18 secs:
##  confirmed 2 attributes: TTCG, TTCGG;
##  rejected 1 attribute: GGT;
##  still have 27 attributes left.
## After 84 iterations, +19 secs:
##  rejected 1 attribute: ACTCG;
##  still have 26 attributes left.
## After 92 iterations, +19 secs:
##  confirmed 1 attribute: GGGA;
##  still have 25 attributes left.
## After 18 iterations, +45 secs:
##  confirmed 4 attributes: ATCAC, GGAT, GGC, GTCTG;
##  rejected 1256 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1251 more;
##  still have 84 attributes left.
## After 22 iterations, +45 secs:
##  confirmed 1 attribute: ATCA;
##  rejected 19 attributes: AAAT, AAG, AGGC, ATCG, ATCT and 14 more;
##  still have 64 attributes left.
## After 26 iterations, +46 secs:
##  rejected 3 attributes: ATCAG, CTGCA, TGCAG;
##  still have 61 attributes left.
## After 30 iterations, +46 secs:
##  confirmed 2 attributes: GCAG, TGCA;
##  rejected 3 attributes: AGTC, CAGGA, TCAAG;
##  still have 56 attributes left.
## After 33 iterations, +47 secs:
##  confirmed 1 attribute: TTTC;
##  rejected 2 attributes: CAAGT, GGAG;
##  still have 53 attributes left.
## After 36 iterations, +47 secs:
##  confirmed 2 attributes: GCTC, GGTCT;
##  rejected 3 attributes: ACAA, CACAG, CTGT;
##  still have 48 attributes left.
## After 39 iterations, +47 secs:
##  rejected 1 attribute: CCTCT;
##  still have 47 attributes left.
## After 49 iterations, +48 secs:
##  confirmed 2 attributes: GGTC, TACA;
##  still have 45 attributes left.
## After 51 iterations, +49 secs:
##  confirmed 1 attribute: GATCA;
##  still have 44 attributes left.
## After 54 iterations, +49 secs:
##  confirmed 1 attribute: AGT;
##  still have 43 attributes left.
## After 57 iterations, +49 secs:
##  confirmed 1 attribute: TCCTC;
##  still have 42 attributes left.
## After 60 iterations, +49 secs:
##  confirmed 2 attributes: GCA, GGATT;
##  still have 40 attributes left.
## After 63 iterations, +50 secs:
##  confirmed 2 attributes: AACG, TCAC;
##  still have 38 attributes left.
## After 66 iterations, +50 secs:
##  confirmed 1 attribute: ACGGG;
##  rejected 1 attribute: CTC;
##  still have 36 attributes left.
## After 68 iterations, +50 secs:
##  rejected 1 attribute: CGGC;
##  still have 35 attributes left.
## After 74 iterations, +51 secs:
##  confirmed 1 attribute: TTTTC;
##  still have 34 attributes left.
## After 84 iterations, +52 secs:
##  rejected 1 attribute: TTGGA;
##  still have 33 attributes left.
## After 87 iterations, +52 secs:
##  confirmed 1 attribute: CATCA;
##  still have 32 attributes left.
## After 92 iterations, +53 secs:
##  rejected 1 attribute: AAT;
##  still have 31 attributes left.
## After 18 iterations, +14 secs:
##  rejected 1338 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1333 more;
##  still have 6 attributes left.
## After 71 iterations, +14 secs:
##  confirmed 1 attribute: AGGG;
##  still have 5 attributes left.
## After 95 iterations, +15 secs:
##  confirmed 1 attribute: TTCTC;
##  still have 4 attributes left.
## After 18 iterations, +13 secs:
##  confirmed 4 attributes: ACCCG, CCCG, CCT, CGT;
##  rejected 1250 attributes: AAAA, AAAAA, AAAAC, AAAAT, AAAC and 1245 more;
##  still have 90 attributes left.
## After 22 iterations, +13 secs:
##  confirmed 1 attribute: ACG;
##  rejected 31 attributes: AAAAG, AAAG, AAAGA, ACAGA, ACTA and 26 more;
##  still have 58 attributes left.
## After 26 iterations, +13 secs:
##  rejected 2 attributes: CCCT, TGGT;
##  still have 56 attributes left.
## After 33 iterations, +13 secs:
##  rejected 4 attributes: AACAT, CTAC, TATGG, TTC;
##  still have 52 attributes left.
## After 36 iterations, +14 secs:
##  confirmed 3 attributes: CCCGG, GCA, GCTC;
##  rejected 6 attributes: AAGA, CAGTT, CTGT, GCGCC, GGCGC and 1 more;
##  still have 43 attributes left.
## After 39 iterations, +14 secs:
##  rejected 3 attributes: CGGCG, CTTAC, TCCG;
##  still have 40 attributes left.
## After 43 iterations, +14 secs:
##  confirmed 1 attribute: CTAT;
##  rejected 1 attribute: ATC;
##  still have 38 attributes left.
## After 46 iterations, +14 secs:
##  confirmed 2 attributes: CCGC, TACTT;
##  rejected 1 attribute: CTGA;
##  still have 35 attributes left.
## After 49 iterations, +14 secs:
##  rejected 3 attributes: ATG, CCCTT, TACT;
##  still have 32 attributes left.
## After 51 iterations, +14 secs:
##  confirmed 2 attributes: AGTTC, GCCC;
##  still have 30 attributes left.
## After 54 iterations, +14 secs:
##  confirmed 1 attribute: GGAAA;
##  still have 29 attributes left.
## After 57 iterations, +15 secs:
##  confirmed 1 attribute: CGCG;
##  rejected 1 attribute: TGGGC;
##  still have 27 attributes left.
## After 60 iterations, +15 secs:
##  confirmed 1 attribute: GCCG;
##  still have 26 attributes left.
## After 66 iterations, +15 secs:
##  confirmed 1 attribute: ACTCC;
##  still have 25 attributes left.
## After 82 iterations, +16 secs:
##  confirmed 1 attribute: ACTGG;
##  still have 24 attributes left.
## After 87 iterations, +16 secs:
##  confirmed 1 attribute: CAG;
##  still have 23 attributes left.
## After 97 iterations, +16 secs:
##  rejected 1 attribute: ATGG;
##  still have 22 attributes left.
## After 18 iterations, +1.1 mins:
##  confirmed 29 attributes: AACCC, AATC, AATCC, ATAA, ATTC and 24 more;
##  rejected 1255 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1250 more;
##  still have 60 attributes left.
## After 22 iterations, +1.1 mins:
##  confirmed 2 attributes: AGC, ATA;
##  rejected 9 attributes: AAG, ACT, AGTTG, CCCGC, CGGGG and 4 more;
##  still have 49 attributes left.
## After 26 iterations, +1.1 mins:
##  confirmed 3 attributes: AACC, GCGAT, TAA;
##  rejected 7 attributes: ATTGC, CCCCC, CGAT, GAACC, TAAG and 2 more;
##  still have 39 attributes left.
## After 30 iterations, +1.1 mins:
##  confirmed 1 attribute: GAA;
##  rejected 2 attributes: CGGA, GCG;
##  still have 36 attributes left.
## After 33 iterations, +1.2 mins:
##  rejected 1 attribute: AAT;
##  still have 35 attributes left.
## After 39 iterations, +1.2 mins:
##  confirmed 1 attribute: GAAT;
##  still have 34 attributes left.
## After 43 iterations, +1.2 mins:
##  confirmed 1 attribute: TAAA;
##  still have 33 attributes left.
## After 46 iterations, +1.2 mins:
##  confirmed 1 attribute: TAGC;
##  rejected 1 attribute: TAAGG;
##  still have 31 attributes left.
## After 49 iterations, +1.2 mins:
##  rejected 2 attributes: AAC, GAC;
##  still have 29 attributes left.
## After 51 iterations, +1.2 mins:
##  confirmed 1 attribute: ACCCC;
##  rejected 1 attribute: TTC;
##  still have 27 attributes left.
## After 57 iterations, +1.2 mins:
##  rejected 1 attribute: AATAC;
##  still have 26 attributes left.
## After 60 iterations, +1.2 mins:
##  rejected 1 attribute: GGC;
##  still have 25 attributes left.
## After 71 iterations, +1.3 mins:
##  rejected 1 attribute: TCCCC;
##  still have 24 attributes left.
## After 74 iterations, +1.3 mins:
##  confirmed 1 attribute: GGGGG;
##  rejected 1 attribute: ATT;
##  still have 22 attributes left.
## After 76 iterations, +1.3 mins:
##  rejected 1 attribute: TGAT;
##  still have 21 attributes left.
## After 18 iterations, +18 secs:
##  confirmed 17 attributes: AAACT, AACTG, AAGAA, ACT, ACTG and 12 more;
##  rejected 1275 attributes: AAAA, AAAAA, AAAAC, AAAAG, AAAAT and 1270 more;
##  still have 52 attributes left.
## After 22 iterations, +18 secs:
##  rejected 7 attributes: AAA, AAGG, CACTG, GAAA, GGAT and 2 more;
##  still have 45 attributes left.
## After 26 iterations, +18 secs:
##  confirmed 3 attributes: AGA, AGAAA, GAAG;
##  rejected 3 attributes: ATA, ATG, GCAT;
##  still have 39 attributes left.
## After 30 iterations, +18 secs:
##  confirmed 2 attributes: GGAA, GTGGA;
##  rejected 2 attributes: AGTC, CCA;
##  still have 35 attributes left.
## After 33 iterations, +18 secs:
##  rejected 4 attributes: AGAAC, ATGG, GCCA, TGAT;
##  still have 31 attributes left.
## After 36 iterations, +19 secs:
##  confirmed 2 attributes: GTGA, TGA;
##  rejected 1 attribute: ATCA;
##  still have 28 attributes left.
## After 39 iterations, +19 secs:
##  confirmed 1 attribute: AAGA;
##  rejected 1 attribute: AGGG;
##  still have 26 attributes left.
## After 49 iterations, +19 secs:
##  confirmed 2 attributes: GTAGA, TGCA;
##  still have 24 attributes left.
## After 51 iterations, +19 secs:
##  rejected 1 attribute: CCTGC;
##  still have 23 attributes left.
## After 57 iterations, +19 secs:
##  rejected 2 attributes: CGGA, GAG;
##  still have 21 attributes left.
## After 68 iterations, +20 secs:
##  confirmed 1 attribute: AACT;
##  still have 20 attributes left.
## After 71 iterations, +20 secs:
##  confirmed 1 attribute: TGG;
##  rejected 1 attribute: TAAG;
##  still have 18 attributes left.
## After 76 iterations, +20 secs:
##  confirmed 2 attributes: GAA, GACTG;
##  still have 16 attributes left.
## After 82 iterations, +21 secs:
##  confirmed 1 attribute: TGCGA;
##  still have 15 attributes left.
## After 87 iterations, +21 secs:
##  confirmed 1 attribute: TAGA;
##  still have 14 attributes left.
## After 18 iterations, +19 secs:
##  confirmed 17 attributes: AAGAA, AAT, AATT, AATTG, AGAAT and 12 more;
##  rejected 1267 attributes: AAAA, AAAAA, AAAAC, AAAAG, AAAAT and 1262 more;
##  still have 60 attributes left.
## After 22 iterations, +20 secs:
##  confirmed 2 attributes: AATGC, TTATA;
##  rejected 10 attributes: AGA, AGTG, ATGG, CGC, CGT and 5 more;
##  still have 48 attributes left.
## After 26 iterations, +20 secs:
##  confirmed 2 attributes: AAGA, AGAA;
##  rejected 1 attribute: TGGAT;
##  still have 45 attributes left.
## After 30 iterations, +20 secs:
##  confirmed 3 attributes: CAC, CTTAT, TGAA;
##  rejected 1 attribute: GCGA;
##  still have 41 attributes left.
## After 33 iterations, +20 secs:
##  rejected 3 attributes: GTC, TCCT, TGCT;
##  still have 38 attributes left.
## After 36 iterations, +20 secs:
##  confirmed 1 attribute: CCT;
##  rejected 2 attributes: AGTGA, GCTT;
##  still have 35 attributes left.
## After 39 iterations, +21 secs:
##  confirmed 2 attributes: AATG, ACGGA;
##  still have 33 attributes left.
## After 46 iterations, +21 secs:
##  confirmed 1 attribute: CGGAA;
##  still have 32 attributes left.
## After 49 iterations, +21 secs:
##  confirmed 1 attribute: CAGTG;
##  still have 31 attributes left.
## After 54 iterations, +21 secs:
##  confirmed 1 attribute: CCGAT;
##  still have 30 attributes left.
## After 60 iterations, +22 secs:
##  confirmed 1 attribute: GAAG;
##  still have 29 attributes left.
## After 66 iterations, +22 secs:
##  confirmed 1 attribute: AAA;
##  still have 28 attributes left.
## After 68 iterations, +22 secs:
##  confirmed 1 attribute: ATGC;
##  still have 27 attributes left.
## After 76 iterations, +23 secs:
##  confirmed 1 attribute: ATT;
##  still have 26 attributes left.
## After 82 iterations, +23 secs:
##  confirmed 1 attribute: TTG;
##  still have 25 attributes left.
## After 84 iterations, +23 secs:
##  confirmed 1 attribute: AACCG;
##  still have 24 attributes left.
## After 87 iterations, +23 secs:
##  confirmed 1 attribute: CACT;
##  rejected 1 attribute: GTCT;
##  still have 22 attributes left.
## After 90 iterations, +23 secs:
##  rejected 1 attribute: ACGG;
##  still have 21 attributes left.
## After 97 iterations, +24 secs:
##  confirmed 1 attribute: GTGA;
##  still have 20 attributes left.
## After 18 iterations, +13 secs:
##  confirmed 2 attributes: ATG, TATG;
##  rejected 1308 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1303 more;
##  still have 34 attributes left.
## After 22 iterations, +13 secs:
##  confirmed 4 attributes: ATGG, ATGGT, TATGG, TGG;
##  rejected 6 attributes: AAGG, ATGC, CGT, GCA, GCTAT and 1 more;
##  still have 24 attributes left.
## After 26 iterations, +13 secs:
##  rejected 1 attribute: CAG;
##  still have 23 attributes left.
## After 30 iterations, +13 secs:
##  confirmed 1 attribute: AATG;
##  rejected 2 attributes: AGGC, GCCCC;
##  still have 20 attributes left.
## After 36 iterations, +14 secs:
##  confirmed 1 attribute: ACCC;
##  rejected 1 attribute: TGGTG;
##  still have 18 attributes left.
## After 46 iterations, +14 secs:
##  rejected 1 attribute: CAGT;
##  still have 17 attributes left.
## After 49 iterations, +14 secs:
##  rejected 1 attribute: GATA;
##  still have 16 attributes left.
## After 54 iterations, +14 secs:
##  confirmed 1 attribute: CCC;
##  still have 15 attributes left.
## After 66 iterations, +14 secs:
##  confirmed 1 attribute: TGCT;
##  rejected 1 attribute: CCCC;
##  still have 13 attributes left.
## After 18 iterations, +14 secs:
##  confirmed 5 attributes: ATCC, ATCCC, GATC, GATCC, TCCC;
##  rejected 1314 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1309 more;
##  still have 25 attributes left.
## After 22 iterations, +14 secs:
##  rejected 4 attributes: CCAA, CCAC, CGCGC, TCCG;
##  still have 21 attributes left.
## After 33 iterations, +15 secs:
##  rejected 1 attribute: TCCCT;
##  still have 20 attributes left.
## After 36 iterations, +15 secs:
##  rejected 1 attribute: GCA;
##  still have 19 attributes left.
## After 49 iterations, +15 secs:
##  confirmed 1 attribute: CAC;
##  still have 18 attributes left.
## After 54 iterations, +15 secs:
##  confirmed 1 attribute: GGAT;
##  still have 17 attributes left.
## After 63 iterations, +15 secs:
##  confirmed 2 attributes: AGGGG, GGATC;
##  still have 15 attributes left.
## After 71 iterations, +16 secs:
##  confirmed 1 attribute: GGCA;
##  still have 14 attributes left.
## After 74 iterations, +16 secs:
##  confirmed 1 attribute: GGGCT;
##  still have 13 attributes left.
## After 76 iterations, +16 secs:
##  confirmed 1 attribute: CACT;
##  still have 12 attributes left.
## After 87 iterations, +16 secs:
##  confirmed 1 attribute: GGCTC;
##  still have 11 attributes left.
## After 95 iterations, +16 secs:
##  confirmed 1 attribute: TCCCG;
##  still have 10 attributes left.
## After 97 iterations, +16 secs:
##  confirmed 1 attribute: AGGG;
##  still have 9 attributes left.
## After 18 iterations, +1 mins:
##  confirmed 4 attributes: ACA, ACAC, CAC, TACA;
##  rejected 1268 attributes: AAAAA, AAAAG, AAAAT, AAAC, AAACA and 1263 more;
##  still have 72 attributes left.
## After 22 iterations, +1 mins:
##  confirmed 5 attributes: ACAT, CACG, CCC, TACAC, TAG;
##  rejected 16 attributes: AAAT, AAATG, AATC, ACAAG, ACACT and 11 more;
##  still have 51 attributes left.
## After 26 iterations, +1 mins:
##  confirmed 1 attribute: GAA;
##  rejected 5 attributes: AAAAC, ATC, CATTG, CCCAA, GCACA;
##  still have 45 attributes left.
## After 30 iterations, +1.1 mins:
##  confirmed 1 attribute: TGCA;
##  rejected 1 attribute: GAAC;
##  still have 43 attributes left.
## After 33 iterations, +1.1 mins:
##  confirmed 1 attribute: AAG;
##  still have 42 attributes left.
## After 36 iterations, +1.1 mins:
##  confirmed 3 attributes: CACAC, CACGC, TACAT;
##  rejected 3 attributes: AAT, TAACC, TCCC;
##  still have 36 attributes left.
## After 39 iterations, +1.1 mins:
##  confirmed 1 attribute: AAAA;
##  still have 35 attributes left.
## After 46 iterations, +1.1 mins:
##  rejected 1 attribute: CACAT;
##  still have 34 attributes left.
## After 49 iterations, +1.1 mins:
##  confirmed 1 attribute: CAAAA;
##  still have 33 attributes left.
## After 54 iterations, +1.1 mins:
##  confirmed 2 attributes: AACC, AGAA;
##  rejected 1 attribute: GAAA;
##  still have 30 attributes left.
## After 60 iterations, +1.1 mins:
##  confirmed 2 attributes: TAGC, TCC;
##  still have 28 attributes left.
## After 66 iterations, +1.1 mins:
##  confirmed 1 attribute: CCCA;
##  rejected 1 attribute: AAA;
##  still have 26 attributes left.
## After 71 iterations, +1.1 mins:
##  confirmed 2 attributes: ATACA, CCAA;
##  still have 24 attributes left.
## After 74 iterations, +1.1 mins:
##  confirmed 1 attribute: CGACC;
##  still have 23 attributes left.
## After 76 iterations, +1.2 mins:
##  confirmed 2 attributes: ACAAC, TAC;
##  still have 21 attributes left.
## After 82 iterations, +1.2 mins:
##  rejected 1 attribute: ACATG;
##  still have 20 attributes left.
## After 18 iterations, +23 secs:
##  confirmed 3 attributes: ACCG, GTACC, TACC;
##  rejected 1323 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1318 more;
##  still have 18 attributes left.
## After 22 iterations, +24 secs:
##  rejected 2 attributes: CCGC, GTAC;
##  still have 16 attributes left.
## After 36 iterations, +24 secs:
##  confirmed 1 attribute: TACCG;
##  still have 15 attributes left.
## After 51 iterations, +25 secs:
##  rejected 1 attribute: TAC;
##  still have 14 attributes left.
## After 54 iterations, +25 secs:
##  confirmed 2 attributes: ACCGT, AGTAC;
##  still have 12 attributes left.
## After 68 iterations, +25 secs:
##  confirmed 2 attributes: GTGTG, TGGTG;
##  still have 10 attributes left.
## After 71 iterations, +25 secs:
##  rejected 1 attribute: TTC;
##  still have 9 attributes left.
## After 79 iterations, +26 secs:
##  confirmed 1 attribute: GCTTA;
##  rejected 1 attribute: CCGT;
##  still have 7 attributes left.
## After 82 iterations, +26 secs:
##  confirmed 1 attribute: GGCTT;
##  still have 6 attributes left.
## After 18 iterations, +19 secs:
##  confirmed 28 attributes: AGTT, AGTTG, ATAGA, CGATG, CTGT and 23 more;
##  rejected 1235 attributes: AAA, AAAA, AAAAA, AAAAC, AAAAG and 1230 more;
##  still have 81 attributes left.
## After 22 iterations, +19 secs:
##  confirmed 3 attributes: ACGTG, CCGAT, CGA;
##  rejected 20 attributes: AAG, AATGG, ACGTT, ACTAT, AGC and 15 more;
##  still have 58 attributes left.
## After 26 iterations, +19 secs:
##  confirmed 3 attributes: AATGA, GCT, TGC;
##  rejected 5 attributes: ACGC, AGT, TCA, TCTAC, TGAAC;
##  still have 50 attributes left.
## After 30 iterations, +19 secs:
##  rejected 2 attributes: CCG, GACGT;
##  still have 48 attributes left.
## After 33 iterations, +20 secs:
##  rejected 3 attributes: AATG, GAT, GGTT;
##  still have 45 attributes left.
## After 36 iterations, +20 secs:
##  rejected 3 attributes: TCGC, TGAA, TGAT;
##  still have 42 attributes left.
## After 39 iterations, +20 secs:
##  confirmed 2 attributes: ATGAT, CGTG;
##  rejected 3 attributes: ACGT, GGT, GTCGA;
##  still have 37 attributes left.
## After 43 iterations, +20 secs:
##  rejected 2 attributes: AACGT, TTT;
##  still have 35 attributes left.
## After 49 iterations, +21 secs:
##  rejected 1 attribute: CAGG;
##  still have 34 attributes left.
## After 51 iterations, +21 secs:
##  rejected 1 attribute: CAG;
##  still have 33 attributes left.
## After 54 iterations, +21 secs:
##  rejected 1 attribute: TAGA;
##  still have 32 attributes left.
## After 60 iterations, +21 secs:
##  confirmed 1 attribute: GAATG;
##  rejected 1 attribute: GGAG;
##  still have 30 attributes left.
## After 63 iterations, +21 secs:
##  confirmed 1 attribute: CGTGA;
##  rejected 1 attribute: GCGC;
##  still have 28 attributes left.
## After 66 iterations, +22 secs:
##  rejected 1 attribute: GATGA;
##  still have 27 attributes left.
## After 68 iterations, +22 secs:
##  rejected 1 attribute: ATAG;
##  still have 26 attributes left.
## After 71 iterations, +22 secs:
##  rejected 1 attribute: ACG;
##  still have 25 attributes left.
## After 74 iterations, +22 secs:
##  rejected 2 attributes: ATC, GAACG;
##  still have 23 attributes left.
## After 76 iterations, +22 secs:
##  rejected 1 attribute: GAGT;
##  still have 22 attributes left.
## After 82 iterations, +23 secs:
##  rejected 1 attribute: TTGG;
##  still have 21 attributes left.
## After 97 iterations, +23 secs:
##  confirmed 1 attribute: CTG;
##  still have 20 attributes left.

Checking the model outputs

That took a while! Let’s double check the data:

model_df %>%
  head()
## # A tibble: 6 x 4
##   molecule data                   rf_orig_fit        boruta_fit  
##   <chr>    <list>                 <list>             <list>      
## 1 ATP      <tibble [132 × 1,345]> <S3: randomForest> <S3: Boruta>
## 2 Arginine <tibble [82 × 1,345]>  <S3: randomForest> <S3: Boruta>
## 3 Codeine  <tibble [94 × 1,345]>  <S3: randomForest> <S3: Boruta>
## 4 Dopamine <tibble [108 × 1,345]> <S3: randomForest> <S3: Boruta>
## 5 Elastase <tibble [104 × 1,345]> <S3: randomForest> <S3: Boruta>
## 6 FAD      <tibble [92 × 1,345]>  <S3: randomForest> <S3: Boruta>
model_df %>%
  select(rf_orig_fit) %>%
  slice(1) %>%
  pull()
## [[1]]
## 
## Call:
##  randomForest(x = X, y = y) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 36
## 
##         OOB estimate of  error rate: 8.33%
## Confusion matrix:
##    0  1 class.error
## 0 65  1  0.01515152
## 1 10 56  0.15151515
model_df %>%
  select(boruta_fit) %>%
  slice(1) %>%
  pull()
## [[1]]
## Boruta performed 99 iterations in 37.49718 secs.
##  35 attributes confirmed important: AAGA, AAGAC, AATT, AATTG, AGA
## and 30 more;
##  1274 attributes confirmed unimportant: AAA, AAAA, AAAAA, AAAAC,
## AAAAG and 1269 more;
##  35 tentative attributes left: AACAC, AACTG, ACACA, AGAC, AGAT and
## 30 more;

The output is in the format we expect. The random forest and Boruta models are stored in list columns, and the output looks like that of a random forest and a Boruta model.

The original random forest for ATP does quite well. It has 8.33% out-of-bag error, but misclassifies 10 sequences that contain aptamers are not containing aptamers, which is an issue. Even both Boruta we still might miss some important k-mers.

But Boruta picked out 35 features as important, and left 35 tentative. Reducing over a thousand features to 35 is quite an accomplishment.

Extract features selected by Boruta

Now that we have the initial models, let’s extract the features that Boruta selected and store those in another list column. Then we’ll append a reduced data.frame with only the features selected by Boruta (and is_aptamer because we can’t get rid of the response variable!).

Note that map2 is like map, but it takes in two arguments instead of one.

limit_to_selected_attributes <- function(dat, selected_attributes) {
  dat_reduced <- dat[, which(names(dat) %in% c("is_aptamer", selected_attributes))]
  return(dat_reduced)  ## Without the assignment and separate return I get an error with map
}

model_df <- model_df %>%
  mutate(selected_attributes = map(boruta_fit, ~getSelectedAttributes(.x, withTentative = FALSE)),
         data_reduced = map2(data, selected_attributes, limit_to_selected_attributes))

head(model_df)
## # A tibble: 6 x 6
##   molecule data     rf_orig_fit   boruta_fit selected_attrib… data_reduced 
##   <chr>    <list>   <list>        <list>     <list>           <list>       
## 1 ATP      <tibble… <S3: randomF… <S3: Boru… <chr [35]>       <tibble [132…
## 2 Arginine <tibble… <S3: randomF… <S3: Boru… <chr [18]>       <tibble [82 …
## 3 Codeine  <tibble… <S3: randomF… <S3: Boru… <chr [9]>        <tibble [94 …
## 4 Dopamine <tibble… <S3: randomF… <S3: Boru… <chr [17]>       <tibble [108…
## 5 Elastase <tibble… <S3: randomF… <S3: Boru… <chr [12]>       <tibble [104…
## 6 FAD      <tibble… <S3: randomF… <S3: Boru… <chr [17]>       <tibble [92 …

We see that ATP has 35 selected features, Arginine has 18, Coedine has 9, and the others have similar numbers of selected features. As mentioned above, 35 is still fairly large, but it’s much smaller than the 1344 we started with.

Reduced predictor set model fit

The next step is to build a random forest on each reduced predictor set. Then we’ll compare the out-of-bag error between the original predictor set and the reduced set.

set.seed(SEED)
model_df <- model_df %>%
  mutate(rf_reduced_fit = map(data_reduced, aptamer_random_forest))

Out-of-bag error comparison

Now we compare out-of-bag error for the full predictor set model and reduced predictor set model for each molecule. We’ll use map_dbl to generate the out-of-bag errors because an out-of-bag error is of type double. We’ll also calculate the number of features selected by each model for general interest.

Finally, we’ll make a smaller data.frame that only contains the information we need to evaluate the models. This is mostly for readability purposes.

extract_oob_err_rate <- function(rf_model) {
  ntrees <- nrow(rf_model$err.rate)
  return (rf_model$err.rate[ntrees, "OOB"])
}

model_df <- model_df %>%
  mutate(orig_oob = map_dbl(rf_orig_fit, extract_oob_err_rate),
         red_oob = map_dbl(rf_reduced_fit, extract_oob_err_rate),
         num_selected = map_int(selected_attributes, length))

results_df <- model_df %>%
  select(molecule, num_selected, orig_oob, red_oob)

results_df
## # A tibble: 23 x 4
##    molecule        num_selected orig_oob red_oob
##    <chr>                  <int>    <dbl>   <dbl>
##  1 ATP                       35   0.0833  0.0530
##  2 Arginine                  18   0.220   0.146 
##  3 Codeine                    9   0.234   0.170 
##  4 Dopamine                  17   0.296   0.176 
##  5 Elastase                  12   0.240   0.144 
##  6 FAD                       17   0.130   0.0761
##  7 HIV1-Tar                  14   0.25    0.163 
##  8 Hematoporphyrin            4   0.309   0.206 
##  9 IDI4                      21   0.173   0.112 
## 10 Isoleucine                88   0.0396  0.0432
## # … with 13 more rows

Interesting! It appears that fitting to the Boruta-selected features has resulted in lower out-of-bag error for many molecules. Let’s compare.

Once again, we care about both decrease in out-of-bag error and reducing the number of features to a more managable number. We therefore want to check the following:

  • What is the median and the IQR of the difference in out-of-bag error across both model types? We look at differences, instead of each model separately, because the difference between “significant” and “not significant” is not itself statistically significant.

  • What is the median number of features within the reduced models? I’m not looking at the percent decrease because I suspect it will be very high for all molecules. Additionally, if we were to put selected subsequences through standard lab tests, a practical number of subsequences to examine must be a relatively low number, independently of the number of sequences we initially considered.

results_df %>% 
  summarize(med_orig = median(orig_oob),
            med_reduced = median(red_oob),
            med_diff = median(orig_oob - red_oob),
            iqr_orig = IQR(orig_oob),
            iqr_reduced = IQR(red_oob),
            med_selected = median(num_selected),
            iqr_selected = IQR(num_selected)) %>%
  print()
## # A tibble: 1 x 7
##   med_orig med_reduced med_diff iqr_orig iqr_reduced med_selected
##      <dbl>       <dbl>    <dbl>    <dbl>       <dbl>        <int>
## 1    0.189       0.133   0.0612    0.160      0.0847           18
## # … with 1 more variable: iqr_selected <dbl>
set.seed(SEED)
wilcox.test(results_df$orig_oob + rnorm(23, 0, 1e-5),  ## Jitter a bit to avoid ties
            results_df$red_oob + rnorm(23, 0, 1e-5), 
            paired = TRUE, conf.int = TRUE, exact = TRUE)
## 
##  Wilcoxon signed rank test
## 
## data:  results_df$orig_oob + rnorm(23, 0, 1e-05) and results_df$red_oob + rnorm(23, 0, 1e-05)
## V = 263, p-value = 2.098e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  0.03591523 0.08653439
## sample estimates:
## (pseudo)median 
##     0.06064958
  • We have a median out-of-bag error difference of 6% with an IQR of 16%. The 6% drop is statistically significant according to the result of the paired Wilcoxon signed rank test.

  • The median number of features selected is 18 with an IQR Of 19.5. That sounds like a reasonable number of molecules to test in a lab setting, especially compared to the 1344 we started with!

Both of these are good signs.

OOB Error Plots

Let’s visualize the difference in OOB-error distributions.

Since we’re interested in the difference in out-of-bag error across both models, we’ll make two plots: one for differences by molecule, and one comparing the distributions of out-of-bag error for both feature sets. We would like to see a drop in out-of-bag error, or at least error that is similar in both models.

Note that we combine plots via the + operator from the patchwork package, which allows arbitrary plot layouts. Additionally, we use tidyr::gather to combine the two out-of-bag error columns for boxplots.

diff_oob_plot <- results_df %>%
  mutate(error_diff = orig_oob - red_oob,
         error_dropped = case_when(error_diff > 0 ~ "Dropped",error_diff < 0 ~ "Increased", TRUE ~ "No Change"),
         molecule = fct_reorder(molecule, error_diff)) %>%
  ggplot(aes(x = error_diff, y = molecule, group = molecule)) +
  geom_point(aes(color = error_dropped), size = 2, show.legend = FALSE) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(labels = scales::percent) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +  ## Remove vertical gridlines
  ggtitle("Differences in out-of-bag error for random forest k-mer models",
          subtitle = 'Positive means "the model on the reduced feature set has lower out-of-bag error"') +
  xlab("(Full features model OOB error) - (Reduced features model OOB error)") +
  ylab("Molecule")

dists_oob_plot <- results_df %>%
  gather(oob_type, oob_val, orig_oob, red_oob) %>%
  mutate(oob_type = if_else(oob_type == "orig_oob", "Original", "Reduced")) %>%
  ggplot(aes(x = oob_type, y = oob_val)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::percent) +
  expand_limits(y = 0) +
  xlab("Feature set") +
  ylab("Out-of-bag error") +
  ggtitle("Distribution of\nOut-of-Bag Error\nby Feature Set")

diff_oob_plot +
  dists_oob_plot

Of the 23 molecules, 19 (82.6%) had decreasing error on the model built on the reduced feature set, and four had increased error. Additionally, the distribution of out-of-bag error is lower overall for the the reduced feature set. This is an indication that in many cases, Boruta does a good job selecting relevant features

Next Steps

In a non-demo situation, biology researchers would take the k-mers selected for each model and would use more definitive, rigorous lab techniques to determine if they are indicative of an aptamer sequence. The authors of the Boruta paper state that the sequences identified by Boruta are consistent with experimental evidence. “…MSA analysis and NMR data suggest that GAAGA motif is responsible for binding adenine in adenine related aptamers. Boruta has found this motif as important in all classes of aptamers binding to three adenine-containing targets that were analysed (adenosine triphosphate, flavin adenine dinucleotide and S-adenosyl methionine).”

If I had more time, I would also have compared the number of features selected to those selected in more typically-seen algorithms, such as LASSO, elastic net, and MARS regression.

Takeaways

I hope that you learned at some of the following from this tutorial:

  • The complexities of small-\(n\) large-\(p\) problems

  • How to process data that arrived in a non-standard format

  • The difficulty of standard feature selection

  • The power of R’s tidyverse programming environment for data processing

Feel free to contact me on Twitter or LinkedIn if you have any feedback, corrections, or general commentary. Thank you!