Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the way to represent factor variables in scikit-learn while using Random Forests?

I am solving a classification problem using Random Forests. For that I have decided to use Python library scikit-learn. But I am new to both Random Forest algorithm and this tool. My data contains many factor variables. I googled for that and found out that it's not right to give numerical values to factor variables like we do in linear regression, as it will treat it as continuous variable and give wrong result. But I could not find anything about how to deal with factor variables in scikit-learn. Please tell me the options to use or point me to some document where I can get it.

like image 396
Prince Kumar Avatar asked Dec 12 '22 15:12

Prince Kumar


1 Answers

If you are using a pandas data frame to you can easily use the get_dummies function to accomplish this. Here's an example:

import pandas as pd

my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']]
df = pd.DataFrame(my_data, columns = ['var1','var2'])
dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_')
print dummy_ranks

   var1__a  var1__b  var1__c  var1__d
0        1        0        0        0
1        0        1        0        0
2        0        0        1        0
3        0        0        0        1
4        1        0        0        0

[5 rows x 4 columns]
like image 82
jay s Avatar answered Dec 27 '22 01:12

jay s