pandas.factorize
encodes input values as an enumerated type or categorical variable.
But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?
Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.
factorize() method helps to get the numeric representation of an array by identifying distinct values.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
A data frame consists of data, which is arranged in rows and columns, and row and column labels. You can easily select, slice or take a subset of the data in several different ways, for example by using labels, by index location, by value and so on. Here we demonstrate some of these operations using a sample DataFrame.
You can use apply
if you need to factorize
each column separately:
df = pd.DataFrame({'A':['type1','type2','type2'], 'B':['type1','type2','type3'], 'C':['type1','type3','type3']}) print (df) A B C 0 type1 type1 type1 1 type2 type2 type3 2 type2 type3 type3 print (df.apply(lambda x: pd.factorize(x)[0])) A B C 0 0 0 0 1 1 1 1 2 1 2 1
If you need for the same string value the same numeric one:
print (df.stack().rank(method='dense').unstack()) A B C 0 1.0 1.0 1.0 1 2.0 2.0 3.0 2 2.0 3.0 3.0
If you need to apply the function only for some columns, use a subset:
df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack() print (df) A B C 0 type1 1.0 1.0 1 type2 2.0 3.0 2 type2 3.0 3.0
Solution with factorize
:
stacked = df[['B','C']].stack() df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack() print (df) A B C 0 type1 0 0 1 type2 1 2 2 type2 2 2
Translate them back is possible via map
by dict
, where you need to remove duplicates by drop_duplicates
:
vals = df.stack().drop_duplicates().values b = [x for x in df.stack().drop_duplicates().rank(method='dense')] d1 = dict(zip(b, vals)) print (d1) {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'} df1 = df.stack().rank(method='dense').unstack() print (df1) A B C 0 1.0 1.0 1.0 1 2.0 2.0 3.0 2 2.0 3.0 3.0 print (df1.stack().map(d1).unstack()) A B C 0 type1 type1 type1 1 type2 type2 type3 2 type2 type3 type3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With