What is the way to represent factor variables in scikit-learn while using Random Forests?

Question

I am solving a classification problem using Random Forests. For that I have decided to use Python library scikit-learn. But I am new to both Random Forest algorithm and this tool. My data contains many factor variables. I googled for that and found out that it's not right to give numerical values to factor variables like we do in linear regression, as it will treat it as continuous variable and give wrong result. But I could not find anything about how to deal with factor variables in scikit-learn. Please tell me the options to use or point me to some document where I can get it.

jay s · Accepted Answer

If you are using a pandas data frame to you can easily use the get_dummies function to accomplish this. Here's an example:

import pandas as pd

my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']]
df = pd.DataFrame(my_data, columns = ['var1','var2'])
dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_')
print dummy_ranks

   var1__a  var1__b  var1__c  var1__d
0        1        0        0        0
1        0        1        0        0
2        0        0        1        0
3        0        0        0        1
4        1        0        0        0

[5 rows x 4 columns]

What is the way to represent factor variables in scikit-learn while using Random Forests?

Tags:

text-mining

scikit-learn

random-forest

Prince Kumar

1 Answers

jay s

Recent Activity

Donate For Us

What is the way to represent factor variables in scikit-learn while using Random Forests?

Tags:

text-mining

scikit-learn

random-forest

Prince Kumar

1 Answers

jay s

Related questions

Recent Activity

Donate For Us