Vectorizing / Contrasting a Dataframe with Categorical Variables

Question

Say I have a dataframe like the following:

      A      B
0   bar    one
1   bar  three
2  flux    six
3   bar  three
4   foo   five
5  flux    one
6   foo    two

I would like to apply dummy-coding contrasting on it so that I get:

(i.e. mapping every unique value to a different integer, per column).

I have tried using scikit-learn's DictVectorizer, but I get:

> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer        = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec            = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

This is because scikit-learn's DictVectorizer is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).

How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?

unutbu · Accepted Answer

You could use pd.factorize:

In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]: 
   A  B
0  0  0
1  0  1
2  1  2
3  0  1
4  2  3
5  1  0
6  2  4

Vectorizing / Contrasting a Dataframe with Categorical Variables

Tags:

python

pandas

scikit-learn

statsmodels

Amelio Vazquez-Reina

1 Answers

unutbu

Recent Activity

Donate For Us

Vectorizing / Contrasting a Dataframe with Categorical Variables

Tags:

python

pandas

scikit-learn

statsmodels

Amelio Vazquez-Reina

1 Answers

unutbu

Related questions

Recent Activity

Donate For Us