I would like to perform feature selection by a wrapper method on the iris data set using mlr package, however I would like to look only at groups of features associated with Petal and/or Sepal. So instead of looking at 4 features in different combinations the wrapper routine would look at two groups of features in different combinations.
The mlr documentation states this can be performed using two arguments bit.names
and bit.to.feature
:
bit.names [character] Names of bits encoding the solutions. Also defines the total number of bits in the encoding. Per default these are the feature names of the task.
bits.to.features [function(x, task)] Function which transforms an integer-0-1 vector into a character vector of selected features. Per default a value of 1 in the ith bit selects the ith feature to be in the candidate solution.
I could not find any examples of usage of these two arguments in mlr tutorials or elsewhere.
I will use the example provided in ?mlr::selectFeatures
.
First operating on all the features
library(mlr)
rdesc <- makeResampleDesc("Holdout")
ctrl <- makeFeatSelControlSequential(method = "sfs",
maxit = NA)
res <- selectFeatures("classif.rpart",
iris.task,
rdesc,
control = ctrl)
analyzeFeatSelResult(res)
This works as expected
In order to run over groups of features I design a 0/1 matrix to map features to groups (I am not sure if this is the way to go, it just seemed logical):
mati <- rbind(
c(0,0,1,1),
c(1,1,0,0))
rownames(mati) <- c("Petal", "Sepal")
colnames(mati) <- getTaskFeatureNames(iris.task)
the matrix looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Petal 0 0 1 1
Sepal 1 1 0 0
and now I run:
res <- selectFeatures("classif.rpart",
iris.task,
rdesc,
control = ctrl,
bit.names = c("Petal", "Sepal"),
bits.to.features = function(x = mati, task) mlr:::binaryToFeatures(x, getTaskFeatureNames(task)))
analyzeFeatSelResult(res)
#output
Features : 1
Performance : mmce.test.mean=0.0200000
Sepal
Path to optimum:
- Features: 0 Init : Perf = 0.66 Diff: NA *
- Features: 1 Add : Sepal Perf = 0.02 Diff: 0.64 *
Stopped, because no improving feature was found.
This appears to perform what I need but I am not quite sure I defined bits.to.features
argument correctly.
But when I try to use the same approach in a wrapper:
outer <- makeResampleDesc("CV", iters = 2L)
inner <- makeResampleDesc("Holdout")
ctrl <- makeFeatSelControlSequential(method = "sfs",
maxit = NA)
lrn <- makeFeatSelWrapper("classif.rpart",
resampling = inner,
control = ctrl,
bit.names = c("Petal", "Sepal"),
bits.to.features = function(x = mati, task) mlr:::binaryToFeatures(x, getTaskFeatureNames(task)))
r <- resample(lrn, iris.task, outer, extract = getFeatSelResult)
I receive an error:
Resampling: cross-validation
Measures: mmce
[FeatSel] Started selecting features for learner 'classif.rpart'
With control class: FeatSelControlSequential
Imputation value: 1
[FeatSel-x] 1: 00 (0 bits)
[FeatSel-y] 1: mmce.test.mean=0.7200000; time: 0.0 min
[FeatSel-x] 2: 10 (1 bits)
[FeatSel-y] 2: mmce.test.mean=0.0800000; time: 0.0 min
[FeatSel-x] 2: 01 (1 bits)
[FeatSel-y] 2: mmce.test.mean=0.0000000; time: 0.0 min
[FeatSel-x] 3: 11 (2 bits)
[FeatSel-y] 3: mmce.test.mean=0.0800000; time: 0.0 min
[FeatSel] Result: Sepal (1 bits)
Error in `[.data.frame`(df, , j, drop = drop) :
undefined columns selected
What am I doing wrong and what is the correct usage of bit.names
and bit.to.feature
arguments?
Thanks
EDIT: I posted an issue on mlr github: https://github.com/mlr-org/mlr/issues/2468
I guess you found two bugs. The first is that your code even runs and the second one is that this won't work with nested resampling.
First of all mati
does not have any effect because it will be overwritten by every internal call of bits.to.features
. After all you just defined a default argument.
What you defined the bit.names
"Petal"
and "Sepal"
you basically just told mlr to use two bits.
So the feature selection will work with the vectors 00, 01, 10, 11.
Unfortunately R now automatically recycles these vectors to the length of 4 so 10 becomes 1010:
mlr:::binaryToFeatures(c(1,0), getTaskFeatureNames(iris.task))
# [1] "Sepal.Length" "Petal.Length"
There we have our first bug, that mlr should avoid the vector recycling here.
To make the code run like intended you could define the function bits.to.features
like this:
bitnames = c("Sepal", "Petal")
btf = function(x, task) {
sets = list(
c("Sepal.Length", "Sepal.Width"),
c("Petal.Length", "Petal.Width")
)
res = unlist(sets[as.logical(x)])
if (is.null(res)) {
return(character(0L))
} else {
return(res)
}
}
res <- selectFeatures("classif.rpart", iris.task, rdesc,
control = ctrl, bits.to.features = btf, bit.names = bitnames)
bts
Quoting the help page of selectFeatures
:
[function(x, task)]
Function which transforms an integer-0-1 vector into a character vector of selected features. Per default a value of 1 in the ith bit selects the ith feature to be in the candidate solution.
So x
is a vector containing 0s and 1s (e.g. c(0,0,1,0)
).
If you don't change that function it would return the name of the third feature (e.g. "Petal.Length"
for iris). The vector x
will be always of the same length as the defined bit.names
. The resulting character vector however can be of any length. It just has to return valid feature names for the task.
In the example I hardcoded the feature names into the function bts
. This is bad practice if you want to apply the function on many different tasks.
Therefore mlr gives you access to the task
object and therefore also on the feature names through getTaskFeatureNames(task)
so you can generate the feature names programmatically and not hard coded.
bit.names
have to be feature namesThe Feature Selection returns the bitnames as a result. Then mlr tries to select these bitnames in the dataset but obviously they are not present as these are totally unrelated (in your case).
This bug is now resolved in the github version of mlr.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With