Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas equivalent of Stata's encode

I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.

x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')

Which results in:

     val
cat     
A     10
A     20
B     30

I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:

     val
cat     
1     10
1     20
2     30

Or, just as good:

  cat  val
0   1   10
1   1   20
2   2   30

Any suggestions?

Many thanks as always, Rob

like image 862
LondonRob Avatar asked Dec 16 '13 20:12

LondonRob


2 Answers

You could use pd.factorize:

import pandas as pd

x = pd.DataFrame({'cat':('A','A','B'), 'val':(10,20,30)})
labels, levels = pd.factorize(x['cat'])
x['cat'] = labels
x = x.set_index('cat')
print(x)

yields

     val
cat     
0     10
0     20
1     30

You could add 1 to labels if you wish to replicate Stata's behaviour:

x['cat'] = labels+1
like image 50
unutbu Avatar answered Sep 17 '22 12:09

unutbu


Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1
like image 40
JohnE Avatar answered Sep 16 '22 12:09

JohnE