I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder. This results in a sparse matrix. 
ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...
ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])
But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!
EDIT In reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas
mapper = DataFrameMapper([
    ('id_col', None),
    ('target_col', None),
    (['col_a'], OneHotEncoder()),
    (['col_b'], OneHotEncoder())
])
t = mapper.fit_transform(df)
But then I get this error:
TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).
OneHotEncoder Encodes categorical integer features as a one-hot numeric array. Its Transform method returns a sparse matrix if sparse=True , otherwise it returns a 2-d array.
Encode categorical features as a one-hot numeric array. By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.
I see you are already using Pandas, so why not using its get_dummies function?
import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])
result
   name age-group
0  rick     young
1  phil       old
2  john  teenager
now you encode with get_dummies
pd.get_dummies(df)
result
name_john  name_phil  name_rick  age-group_old  age-group_teenager  \
0          0          0          1              0                   0   
1          0          1          0              1                   0   
2          1          0          0              0                   1   
   age-group_young  
0                1  
1                0  
2                0
And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.
Look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py
Problem is that you are not using the sparse matrices to xx.fit(). You are using the original data.   
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With