Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing / Contrasting a Dataframe with Categorical Variables

Say I have a dataframe like the following:

      A      B
0   bar    one
1   bar  three
2  flux    six
3   bar  three
4   foo   five
5  flux    one
6   foo    two

I would like to apply dummy-coding contrasting on it so that I get:

    A    B
0   0    0
1   0    2
2   1    1
3   0    2
4   2    3
5   1    0
6   2    4

(i.e. mapping every unique value to a different integer, per column).

I have tried using scikit-learn's DictVectorizer, but I get:

> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer        = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec            = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

This is because scikit-learn's DictVectorizer is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).

How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?

like image 620
Amelio Vazquez-Reina Avatar asked Dec 15 '22 19:12

Amelio Vazquez-Reina


1 Answers

You could use pd.factorize:

In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]: 
   A  B
0  0  0
1  0  1
2  1  2
3  0  1
4  2  3
5  1  0
6  2  4
like image 132
unutbu Avatar answered Apr 09 '23 07:04

unutbu