Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe character columns to integer

Tags:

python

pandas

I have my dataframe like below:

+--------------+--------------+----+-----+-------+
|      x1      |      x2      | km | gmm | class |
+--------------+--------------+----+-----+-------+
|  180.9863129 | -0.266379416 | 24 |  19 | T     |
|  52.20132828 |  28.93587875 | 16 |  14 | I     |
| -17.17127419 |  29.97013283 | 17 |  16 | D     |
|  37.28710938 | -69.96691132 |  3 |   6 | N     |
| -132.2395782 |  27.02541733 | 15 |  18 | G     |
| -12.52811623 | -87.90951538 | 22 |   5 | S     |

The classes are basically alphabets(A to Z). However, I want the output like A=1, B=2... Z= 26.

Now, for normal python list, I can convert them like ord(c.lower()) - ord('a')) % 9) + 1

However, how to do that in a dataframe

like image 459
Jadu Sen Avatar asked Mar 06 '23 13:03

Jadu Sen


1 Answers

Option 1
Assuming your column only has single, uppercase characters, you can do a little arithmetic on the view:

df['class'] = df['class'].values.astype('<U1').view(np.uint32) - 64

df
           x1         x2  km  gmm  class
0  180.986313  -0.266379  24   19     20
1   52.201328  28.935879  16   14      9
2  -17.171274  29.970133  17   16      4
3   37.287109 -69.966911   3    6     14
4 -132.239578  27.025417  15   18      7
5  -12.528116 -87.909515  22    5     19

This is the fastest method I can think of for large data.

If there is the chance you have erratic data, you may consider a preprocessing step like this:

df['class'] = df['class'].str.upper().str[0]

Option 2
ord

df['class'] = [ord(c) - 64 for c in df['class']]

Or,

df['class'] = df['class'].apply(ord) - 64

df
           x1         x2  km  gmm  class
0  180.986313  -0.266379  24   19     20
1   52.201328  28.935879  16   14      9
2  -17.171274  29.970133  17   16      4
3   37.287109 -69.966911   3    6     14
4 -132.239578  27.025417  15   18      7
5  -12.528116 -87.909515  22    5     19
like image 126
cs95 Avatar answered Mar 27 '23 04:03

cs95