Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas convert string to int

Tags:

python

pandas

I have a large dataframe with ID numbers:

ID.head()
Out[64]: 
0    4806105017087
1    4806105017087
2    4806105017087
3    4901295030089
4    4901295030089

These are all strings at the moment.

I want to convert to int without using loops - for this I use ID.astype(int).

The problem is that some of my lines contain dirty data which cannot be converted to int, for e.g.

ID[154382]
Out[58]: 'CN414149'

How can I (without using loops) remove these type of occurrences so that I can use astype with peace of mind?

like image 378
gmarais Avatar asked Mar 10 '17 13:03

gmarais


People also ask

How do I convert a string to an int in Python?

To convert a string to integer in Python, use the int() function. This function takes two parameters: the initial string and the optional base to represent the data. Use the syntax print(int("STR")) to return the str as an int , or integer.

How do I convert data to numeric in pandas?

to_numeric() The best way to convert one or more columns of a DataFrame to numeric values is to use pandas. to_numeric() . This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.


1 Answers

You need add parameter errors='coerce' to function to_numeric:

ID = pd.to_numeric(ID, errors='coerce')

If ID is column:

df.ID = pd.to_numeric(df.ID, errors='coerce')

but non numeric are converted to NaN, so all values are float.

For int need convert NaN to some value e.g. 0 and then cast to int:

df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)

Sample:

df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']})
print (df)
              ID
0  4806105017087
1  4806105017087
2       CN414149

print (pd.to_numeric(df.ID, errors='coerce'))
0    4.806105e+12
1    4.806105e+12
2             NaN
Name: ID, dtype: float64

df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
print (df)
              ID
0  4806105017087
1  4806105017087
2              0

EDIT: If use pandas 0.25+ then is possible use integer_na:

df.ID = pd.to_numeric(df.ID, errors='coerce').astype('Int64')
print (df)
              ID
0  4806105017087
1  4806105017087
2            NaN
like image 159
jezrael Avatar answered Oct 06 '22 23:10

jezrael