Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using bit.names and bits.to.features arguments to makeFeatSelWrapper (mlr) to perform wrapper selection over groups of features

I would like to perform feature selection by a wrapper method on the iris data set using mlr package, however I would like to look only at groups of features associated with Petal and/or Sepal. So instead of looking at 4 features in different combinations the wrapper routine would look at two groups of features in different combinations.

The mlr documentation states this can be performed using two arguments bit.names and bit.to.feature:

bit.names [character] Names of bits encoding the solutions. Also defines the total number of bits in the encoding. Per default these are the feature names of the task.

bits.to.features [function(x, task)] Function which transforms an integer-0-1 vector into a character vector of selected features. Per default a value of 1 in the ith bit selects the ith feature to be in the candidate solution.

I could not find any examples of usage of these two arguments in mlr tutorials or elsewhere.

I will use the example provided in ?mlr::selectFeatures.

First operating on all the features

library(mlr)
rdesc <- makeResampleDesc("Holdout")
ctrl <- makeFeatSelControlSequential(method = "sfs",
                                    maxit = NA)
res <- selectFeatures("classif.rpart",
                     iris.task,
                     rdesc,
                     control = ctrl)
analyzeFeatSelResult(res)

This works as expected

In order to run over groups of features I design a 0/1 matrix to map features to groups (I am not sure if this is the way to go, it just seemed logical):

mati <- rbind(
  c(0,0,1,1),
  c(1,1,0,0))

rownames(mati) <- c("Petal", "Sepal")
colnames(mati) <- getTaskFeatureNames(iris.task)

the matrix looks like:

      Sepal.Length Sepal.Width Petal.Length Petal.Width
Petal            0           0            1           1
Sepal            1           1            0           0

and now I run:

res <- selectFeatures("classif.rpart",
                     iris.task,
                     rdesc,
                     control = ctrl,
                     bit.names = c("Petal", "Sepal"),
                     bits.to.features = function(x = mati, task) mlr:::binaryToFeatures(x, getTaskFeatureNames(task)))

analyzeFeatSelResult(res)
#output
Features         : 1
Performance      : mmce.test.mean=0.0200000
Sepal

Path to optimum:
- Features:    0  Init   :                       Perf = 0.66  Diff: NA  *
- Features:    1  Add    : Sepal                 Perf = 0.02  Diff: 0.64  *

Stopped, because no improving feature was found.

This appears to perform what I need but I am not quite sure I defined bits.to.features argument correctly.

But when I try to use the same approach in a wrapper:

outer <- makeResampleDesc("CV", iters = 2L)
inner <- makeResampleDesc("Holdout")
ctrl <- makeFeatSelControlSequential(method = "sfs",
                                     maxit = NA)


lrn <- makeFeatSelWrapper("classif.rpart",
                          resampling = inner,
                          control = ctrl,
                          bit.names = c("Petal", "Sepal"),
                          bits.to.features = function(x = mati, task) mlr:::binaryToFeatures(x, getTaskFeatureNames(task)))


r <- resample(lrn, iris.task, outer, extract = getFeatSelResult)

I receive an error:

Resampling: cross-validation
Measures:             mmce      
[FeatSel] Started selecting features for learner 'classif.rpart'
With control class: FeatSelControlSequential
Imputation value: 1
[FeatSel-x] 1: 00 (0 bits)
[FeatSel-y] 1: mmce.test.mean=0.7200000; time: 0.0 min
[FeatSel-x] 2: 10 (1 bits)
[FeatSel-y] 2: mmce.test.mean=0.0800000; time: 0.0 min
[FeatSel-x] 2: 01 (1 bits)
[FeatSel-y] 2: mmce.test.mean=0.0000000; time: 0.0 min
[FeatSel-x] 3: 11 (2 bits)
[FeatSel-y] 3: mmce.test.mean=0.0800000; time: 0.0 min
[FeatSel] Result: Sepal (1 bits)
Error in `[.data.frame`(df, , j, drop = drop) : 
  undefined columns selected

What am I doing wrong and what is the correct usage of bit.names and bit.to.feature arguments?

Thanks

EDIT: I posted an issue on mlr github: https://github.com/mlr-org/mlr/issues/2468

like image 523
missuse Avatar asked Oct 17 '22 11:10

missuse


1 Answers

I guess you found two bugs. The first is that your code even runs and the second one is that this won't work with nested resampling.

Bug 1: Your code should not run

First of all mati does not have any effect because it will be overwritten by every internal call of bits.to.features. After all you just defined a default argument.

What you defined the bit.names "Petal" and "Sepal" you basically just told mlr to use two bits. So the feature selection will work with the vectors 00, 01, 10, 11. Unfortunately R now automatically recycles these vectors to the length of 4 so 10 becomes 1010:

mlr:::binaryToFeatures(c(1,0), getTaskFeatureNames(iris.task))
# [1] "Sepal.Length" "Petal.Length"

There we have our first bug, that mlr should avoid the vector recycling here.

To make the code run like intended you could define the function bits.to.features like this:

bitnames = c("Sepal", "Petal")
btf = function(x, task) {
  sets = list(
    c("Sepal.Length", "Sepal.Width"), 
    c("Petal.Length", "Petal.Width")
  )
  res = unlist(sets[as.logical(x)])
  if (is.null(res)) {
    return(character(0L))
  } else {
    return(res)  
  }
}

res <- selectFeatures("classif.rpart", iris.task, rdesc, 
  control = ctrl, bits.to.features = btf, bit.names = bitnames)

Explanation of bts

Quoting the help page of selectFeatures:

[function(x, task)] Function which transforms an integer-0-1 vector into a character vector of selected features. Per default a value of 1 in the ith bit selects the ith feature to be in the candidate solution.

So x is a vector containing 0s and 1s (e.g. c(0,0,1,0)). If you don't change that function it would return the name of the third feature (e.g. "Petal.Length" for iris). The vector xwill be always of the same length as the defined bit.names. The resulting character vector however can be of any length. It just has to return valid feature names for the task.

In the example I hardcoded the feature names into the function bts. This is bad practice if you want to apply the function on many different tasks. Therefore mlr gives you access to the task object and therefore also on the feature names through getTaskFeatureNames(task) so you can generate the feature names programmatically and not hard coded.

Bug 2: The bit.names have to be feature names

The Feature Selection returns the bitnames as a result. Then mlr tries to select these bitnames in the dataset but obviously they are not present as these are totally unrelated (in your case). This bug is now resolved in the github version of mlr.

like image 78
jakob-r Avatar answered Oct 20 '22 15:10

jakob-r