<code>pandas.factorize</code> encodes input values as an enumerated type or categorical variable. But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step? Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later. <img src="https://i.stack.imgur.com/MLATh.png" alt="enter image description here">

You can use <code>apply</code> if you need to <code>factorize</code> each column separately: <pre class="prettyprint"><code>df = pd.DataFrame({'A':['type1','type2','type2'], 'B':['type1','type2','type3'], 'C':['type1','type3','type3']}) print (df) A B C 0 type1 type1 type1 1 type2 type2 type3 2 type2 type3 type3 print (df.apply(lambda x: pd.factorize(x)[0])) A B C 0 0 0 0 1 1 1 1 2 1 2 1 </code></pre> If you need for the same string value the same numeric one: <pre class="prettyprint"><code>print (df.stack().rank(method='dense').unstack()) A B C 0 1.0 1.0 1.0 1 2.0 2.0 3.0 2 2.0 3.0 3.0 </code></pre> <hr> If you need to apply the function only for some columns, use a subset: <pre class="prettyprint"><code>df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack() print (df) A B C 0 type1 1.0 1.0 1 type2 2.0 3.0 2 type2 3.0 3.0 </code></pre> Solution with <code>factorize</code>: <pre class="prettyprint"><code>stacked = df[['B','C']].stack() df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack() print (df) A B C 0 type1 0 0 1 type2 1 2 2 type2 2 2 </code></pre> Translate them back is possible via <code>map</code> by <code>dict</code>, where you need to remove duplicates by <code>drop_duplicates</code>: <pre class="prettyprint"><code>vals = df.stack().drop_duplicates().values b = [x for x in df.stack().drop_duplicates().rank(method='dense')] d1 = dict(zip(b, vals)) print (d1) {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'} df1 = df.stack().rank(method='dense').unstack() print (df1) A B C 0 1.0 1.0 1.0 1 2.0 2.0 3.0 2 2.0 3.0 3.0 print (df1.stack().map(d1).unstack()) A B C 0 type1 type1 type1 1 type2 type2 type3 2 type2 type3 type3 </code></pre>

pandas.factorize on an entire data frame

1 Answers

You can use apply if you need to factorize each column separately:

df = pd.DataFrame({'A':['type1','type2','type2'],                    'B':['type1','type2','type3'],                    'C':['type1','type3','type3']})  print (df)        A      B      C 0  type1  type1  type1 1  type2  type2  type3 2  type2  type3  type3  print (df.apply(lambda x: pd.factorize(x)[0]))    A  B  C 0  0  0  0 1  1  1  1 2  1  2  1

If you need for the same string value the same numeric one:

print (df.stack().rank(method='dense').unstack())      A    B    C 0  1.0  1.0  1.0 1  2.0  2.0  3.0 2  2.0  3.0  3.0

If you need to apply the function only for some columns, use a subset:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack() print (df)        A    B    C 0  type1  1.0  1.0 1  type2  2.0  3.0 2  type2  3.0  3.0

Solution with factorize:

stacked = df[['B','C']].stack() df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack() print (df)        A  B  C 0  type1  0  0 1  type2  1  2 2  type2  2  2

Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values b = [x for x in df.stack().drop_duplicates().rank(method='dense')]  d1 = dict(zip(b, vals)) print (d1) {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}  df1 = df.stack().rank(method='dense').unstack() print (df1)      A    B    C 0  1.0  1.0  1.0 1  2.0  2.0  3.0 2  2.0  3.0  3.0  print (df1.stack().map(d1).unstack())        A      B      C 0  type1  type1  type1 1  type2  type2  type3 2  type2  type3  type3

122

answered Oct 18 '22 18:10

jezrael

Related questions
                            
                                Angular2 i18n for placeholder text
                            
                                laravel 5.4 embed image in mail
                            
                                Is a Firebase UID always 28 characters?
                            
                                Custom legend labels in my rechart chart
                            
                                How to use fastScrollEnabled in RecyclerView?
                            
                                Could not find declaration file for enzyme-adapter-react-16?
                            
                                How to convert buffer to stream in Nodejs
                            
                                pytest: How to get a list of all failed tests at the end of the session? (and while using xdist)
                            
                                vuejs: @keyup.esc on div element is not working
                            
                                Update the attribute value of an object using the map function in ES6
                            
                                How can dataclasses be made to work better with __slots__?
                            
                                Angular Upgrade From 5 to 6: asset path must start with the project source root

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas.factorize on an entire data frame

Tags:

clstaudt

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us