My question is very similar to this one, but I need to convert my entire dataframe instead of just a series. The to_numeric
function only works on one series at a time and is not a good replacement for the deprecated convert_objects
command. Is there a way to get similar results to the convert_objects(convert_numeric=True)
command in the new pandas release?
Thank you Mike Müller for your example. df.apply(pd.to_numeric)
works very well if the values can all be converted to integers. What if in my dataframe I had strings that could not be converted into integers? Example:
df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']}) df.dtypes Out[59]: Words object ints object dtype: object
Then I could run the deprecated function and get:
df = df.convert_objects(convert_numeric=True) df.dtypes Out[60]: Words object ints int64 dtype: object
Running the apply
command gives me errors, even with try and except handling.
Convert Column to int (Integer) Use pandas DataFrame. astype() function to convert column to int (integer), you can apply this on a specific column or on an entire DataFrame. To cast the data type to 64-bit signed integer, you can use numpy. int64 , numpy.
To convert columns of an R data frame from integer to numeric we can use lapply function. For example, if we have a data frame df that contains all integer columns then we can use the code lapply(df,as. numeric) to convert all of the columns data type into numeric data type.
round() function is used to round a DataFrame to a variable number of decimal places. This function provides the flexibility to round different columns by different places.
You can apply the function to all columns:
df.apply(pd.to_numeric)
Example:
>>> df = pd.DataFrame({'a': ['1', '2'], 'b': ['45.8', '73.9'], 'c': [10.5, 3.7]}) >>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 3 columns): a 2 non-null object b 2 non-null object c 2 non-null float64 dtypes: float64(1), object(2) memory usage: 64.0+ bytes >>> df.apply(pd.to_numeric).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 3 columns): a 2 non-null int64 b 2 non-null float64 c 2 non-null float64 dtypes: float64(2), int64(1) memory usage: 64.0 bytes
pd.to_numeric
has the keyword argument errors
:
Signature: pd.to_numeric(arg, errors='raise') Docstring: Convert argument to a numeric type. Parameters ---------- arg : list, tuple or array of objects, or Series errors : {'ignore', 'raise', 'coerce'}, default 'raise' - If 'raise', then invalid parsing will raise an exception - If 'coerce', then invalid parsing will be set as NaN - If 'ignore', then invalid parsing will return the input
Setting it to ignore
will return the column unchanged if it cannot be converted into a numeric type.
As pointed out by Anton Protopopov, the most elegant way is to supply ignore
as keyword argument to apply()
:
>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']}) >>> df.apply(pd.to_numeric, errors='ignore').info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 2 columns): Words 2 non-null object ints 2 non-null int64 dtypes: int64(1), object(1) memory usage: 48.0+ bytes
My previously suggested way, using partial from the module functools
, is more verbose:
>>> from functools import partial >>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']}) >>> df.apply(partial(pd.to_numeric, errors='ignore')).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 2 columns): Words 2 non-null object ints 2 non-null int64 dtypes: int64(1), object(1) memory usage: 48.0+ bytes
The accepted answer with pd.to_numeric() converts to float, as soon as it is needed. Reading the question in detail, it is about converting any numeric column to integer. That is why the accepted answer needs a loop over all columns to convert the numbers to int in the end.
Just for completeness, this is even possible without pd.to_numeric(); of course, this is not recommended:
df = pd.DataFrame({'a': ['1', '2'], 'b': ['45.8', '73.9'], 'c': [10.5, 3.7]}) for i in df.columns: try: df[[i]] = df[[i]].astype(float).astype(int) except: pass print(df.dtypes)
Out:
a int32 b int32 c int32 dtype: object
EDITED: Mind that this not recommended solution is unnecessarily complicated; pd.to_numeric()
can simply use the keyword argument downcast='integer'
to force integer as output, thank you for the comment. This is then still missing in the accepted answer, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With