I am solving a classification problem using Random Forests. For that I have decided to use Python library scikit-learn. But I am new to both Random Forest algorithm and this tool. My data contains many factor variables. I googled for that and found out that it's not right to give numerical values to factor variables like we do in linear regression, as it will treat it as continuous variable and give wrong result. But I could not find anything about how to deal with factor variables in scikit-learn. Please tell me the options to use or point me to some document where I can get it.
If you are using a pandas data frame to you can easily use the get_dummies function to accomplish this. Here's an example:
import pandas as pd
my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']]
df = pd.DataFrame(my_data, columns = ['var1','var2'])
dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_')
print dummy_ranks
var1__a var1__b var1__c var1__d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
[5 rows x 4 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With