I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.
x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')
Which results in:
val
cat
A 10
A 20
B 30
I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:
val
cat
1 10
1 20
2 30
Or, just as good:
cat val
0 1 10
1 1 20
2 2 30
Any suggestions?
Many thanks as always, Rob
You could use pd.factorize
:
import pandas as pd
x = pd.DataFrame({'cat':('A','A','B'), 'val':(10,20,30)})
labels, levels = pd.factorize(x['cat'])
x['cat'] = labels
x = x.set_index('cat')
print(x)
yields
val
cat
0 10
0 20
1 30
You could add 1 to labels
if you wish to replicate Stata's behaviour:
x['cat'] = labels+1
Stata's encode
command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).
See documentation here.
To demonstrate for this example, the Stata command would be something like:
encode cat, generate(cat2)
whereas the pandas command would be:
x['cat2'] = x['cat'].astype('category')
cat val cat2
0 A 10 A
1 A 20 A
2 B 30 B
Just as Stata does with encode
, the data are stored as integers, but display as strings in the default output.
You can verify this by using the categorical accessor cat
to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)
x['cat2'].cat.codes
0 0
1 0
2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With