Consider the following situation:
In [2]: a = pd.Series([1,2,3,4,'.'])
In [3]: a
Out[3]:
0 1
1 2
2 3
3 4
4 .
dtype: object
In [8]: a.astype('float64', raise_on_error = False)
Out[8]:
0 1
1 2
2 3
3 4
4 .
dtype: object
I would have expected an option that allows conversion while turning erroneous values (such as that .
) to NaN
s. Is there a way to achieve this?
pandas Convert String to FloatUse pandas DataFrame. astype() function to convert column from string/int to float, you can apply this on a specific column or on an entire DataFrame. To cast the data type to 54-bit signed float, you can use numpy. float64 , numpy.
Using astype() astype() method is used to cast a pandas column to the specified dtype.
Change data type of a series in PandasUse a numpy. dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy. dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
To convert a pandas Series to a list, simply call the tolist() method on the series which you wish to convert.
pd.to_numeric
with errors='coerce'
# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s
0 1
1 2
2 3
3 4
4 .
dtype: object
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
If you need the NaN
s filled in, use Series.fillna
.
pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')
0 1
1 2
2 3
3 4
4 0
dtype: float64
Note, downcast='infer'
will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.
From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use
pd.__version__ # '0.24.1' pd.to_numeric(s, errors='coerce').astype('Int32') 0 1 1 2 2 3 3 4 4 NaN dtype: Int32
There are other options to choose from as well, read the docs for more.
DataFrames
If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply
.
# Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce')
df2
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
df2.dtypes
A int64
B float64
C int64
D float64
dtype: object
You can also do this with DataFrame.transform
; although my tests indicate this is marginally slower:
df.transform(pd.to_numeric, errors='coerce')
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric
on the non-numeric columns only.
df.dtypes.eq(object)
A False
B True
C False
D True
dtype: bool
cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
# df[c] = pd.to_numeric(df[c], errors='coerce')
df
A B C D
0 5 1.0 9 23.0
1 0 NaN 3 1.0
2 3 NaN 5 NaN
3 3 50.0 2 268.0
4 7 234.0 4 NaN
Applying pd.to_numeric
along the columns (i.e., axis=0
, the default) should be slightly faster for long DataFrames.
In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
Out[30]:
0 1
1 2
2 3
3 4
4 NaN
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With