Say I have a dataframe like the following:
A B
0 bar one
1 bar three
2 flux six
3 bar three
4 foo five
5 flux one
6 foo two
I would like to apply dummy-coding contrasting on it so that I get:
A B
0 0 0
1 0 2
2 1 1
3 0 2
4 2 3
5 1 0
6 2 4
(i.e. mapping every unique value to a different integer, per column).
I have tried using scikit-learn's DictVectorizer, but I get:
> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 1.]])
This is because scikit-learn's DictVectorizer
is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).
How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?
You could use pd.factorize:
In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]:
A B
0 0 0
1 0 1
2 1 2
3 0 1
4 2 3
5 1 0
6 2 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With