Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: how to use random forests to predict binary outcome using string variables?

Consider the following dataframe

outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')

df = df=data.frame(outcome,string)


> df
  outcome        string
1       1  I love pasta
2       0   hello world
3       0       1+1 = 2
4       1 pasta madness
5       1 pizza madness

Here I would like to use random forests to understand which words in the sentences contained in the string variable are strong predictors of the outcome variable.

Is there a (simple) way to do that in R?

like image 772
ℕʘʘḆḽḘ Avatar asked Oct 17 '25 06:10

ℕʘʘḆḽḘ


1 Answers

What you want is the variable importance measures as produced by randomForest. This is obtained from the importance function. Here is some code that should get you started:

outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')

Step 1: We want outcome to be a factor so that randomForest will do classification and string as character vectors.

df <- data.frame(outcome=factor(outcome,levels=c(0,1)),string, stringsAsFactors=FALSE)

Step 2: Tokenize the string column into words. Here, I'm using dplyr and tidyr just for convenience. The key is to have just word tokens that you want as your predictor variable.

library(dplyr)
library(tidyr)
inp <- df %>% mutate(string=strsplit(string,split=" ")) %>% unnest(string)
##   outcome  string
##1        1       I
##2        1    love
##3        1   pasta
##4        0   hello
##5        0   world
##6        0     1+1
##7        0       =
##8        0       2
##9        1   pasta
##10       1 madness
##11       1   pizza
##12       1 madness

Step 3: Construct a model matrix and feed it to randomForest:

library(randomForest)
mm <- model.matrix(outcome~string,inp)
rf <- randomForest(mm, inp$outcome, importance=TRUE)
imp <- importance(rf)
##                     0        1 MeanDecreaseAccuracy MeanDecreaseGini
##(Intercept)   0.000000 0.000000             0.000000        0.0000000
##string1+1     0.000000 0.000000             0.000000        0.3802400
##string2       0.000000 0.000000             0.000000        0.4514319
##stringhello   0.000000 0.000000             0.000000        0.4152465
##stringI       0.000000 0.000000             0.000000        0.2947108
##stringlove    0.000000 0.000000             0.000000        0.2944955
##stringmadness 4.811252 5.449195             5.610477        0.5733814
##stringpasta   4.759957 5.281133             5.368852        0.6651675
##stringpizza   0.000000 0.000000             0.000000        0.3025495
##stringworld   0.000000 0.000000             0.000000        0.4183821

As you can see, pasta and madness are key words to predict the outcome.

Please Note: There are many parameters to randomForest that will be relevant for tackling the real-problem of scale. This is by no means a complete solution to your problem. It is only meant to illustrate the use of the importance function in answering your question. You may want to ask appropriate questions on Cross Validated concerning the details of using randomForest.

like image 64
aichao Avatar answered Oct 19 '25 23:10

aichao



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!