Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.factorize on an entire data frame

Tags:

pandas.factorize encodes input values as an enumerated type or categorical variable.

But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?

Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.

enter image description here

like image 307
clstaudt Avatar asked Sep 08 '16 11:09

clstaudt


People also ask

What does the PD factorize () function do?

factorize() method helps to get the numeric representation of an array by identifying distinct values.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

Can you slice DataFrame?

A data frame consists of data, which is arranged in rows and columns, and row and column labels. You can easily select, slice or take a subset of the data in several different ways, for example by using labels, by index location, by value and so on. Here we demonstrate some of these operations using a sample DataFrame.


1 Answers

You can use apply if you need to factorize each column separately:

df = pd.DataFrame({'A':['type1','type2','type2'],                    'B':['type1','type2','type3'],                    'C':['type1','type3','type3']})  print (df)        A      B      C 0  type1  type1  type1 1  type2  type2  type3 2  type2  type3  type3  print (df.apply(lambda x: pd.factorize(x)[0]))    A  B  C 0  0  0  0 1  1  1  1 2  1  2  1 

If you need for the same string value the same numeric one:

print (df.stack().rank(method='dense').unstack())      A    B    C 0  1.0  1.0  1.0 1  2.0  2.0  3.0 2  2.0  3.0  3.0 

If you need to apply the function only for some columns, use a subset:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack() print (df)        A    B    C 0  type1  1.0  1.0 1  type2  2.0  3.0 2  type2  3.0  3.0 

Solution with factorize:

stacked = df[['B','C']].stack() df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack() print (df)        A  B  C 0  type1  0  0 1  type2  1  2 2  type2  2  2 

Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values b = [x for x in df.stack().drop_duplicates().rank(method='dense')]  d1 = dict(zip(b, vals)) print (d1) {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}  df1 = df.stack().rank(method='dense').unstack() print (df1)      A    B    C 0  1.0  1.0  1.0 1  2.0  2.0  3.0 2  2.0  3.0  3.0  print (df1.stack().map(d1).unstack())        A      B      C 0  type1  type1  type1 1  type2  type2  type3 2  type2  type3  type3 
like image 122
jezrael Avatar answered Oct 18 '22 18:10

jezrael