Applying pandas.to_numeric
to a dataframe column which contains strings that represent numbers (and possibly other unparsable strings) results in an error message like this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-07383316d7b6> in <module>()
1 for column in shouldBeNumericColumns:
----> 2 trainData[column] = pandas.to_numeric(trainData[column])
/usr/local/lib/python3.5/site-packages/pandas/tools/util.py in to_numeric(arg, errors)
113 try:
114 values = lib.maybe_convert_numeric(values, set(),
--> 115 coerce_numeric=coerce_numeric)
116 except:
117 if errors == 'raise':
pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53558)()
pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53344)()
ValueError: Unable to parse string
Wouldn't it be helpful to see which value failed to parse?
errors: It can have three values that are 'ignore', 'raise', 'coerce'. The default value is 'raise'. If it is 'raise', then invalid parsing will set an exception.
Using “contains” to Find a Substring in a Pandas DataFrame The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not.
to_numeric. Convert argument to a numeric type. The default return dtype is float64 or int64 depending on the data supplied.
I think you can add parameter errors='coerce'
for convert bad non numeric values to NaN
, then check this values by isnull
and use boolean indexing
:
print (df[pd.to_numeric(df.col, errors='coerce').isnull()])
Sample:
df = pd.DataFrame({'B':['a','7','8'],
'C':[7,8,9]})
print (df)
B C
0 a 7
1 7 8
2 8 9
print (df[pd.to_numeric(df.B, errors='coerce').isnull()])
B C
0 a 7
Or if need find all string in mixed column - numerice with string values check type
of values if is string
:
df = pd.DataFrame({'B':['a',7, 8],
'C':[7,8,9]})
print (df)
B C
0 a 7
1 7 8
2 8 9
print (df[df.B.apply(lambda x: isinstance(x, str))])
B C
0 a 7
I have thought the very same thing, and I don't know if there's a better way, but my current workaround is to search for characters which aren't numbers or periods. This usually turns up the problem. There are cases where multiple periods can cause a problem, but I've found those are rare.
import pandas as pd
import re
non_numeric = re.compile(r'[^\d.]+')
df = pd.DataFrame({'a': [3,2,'NA']})
df.loc[df['a'].str.contains(non_numeric)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With