You could use np.isreal
to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As @CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop
all the data_columns
from the df
, and then use a join
to put them back in after passing them through pd.to_numeric
(with option 'coerce'
, such that all non-numeric entries are converted to NaN
). The result is saved to num_df
.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric
is coercing to NaN
everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25'
will be recognized as the numeric value 1.25
.
Disclaimer: pd.to_numeric
was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad'
only, not things like 'good'
; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values. and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123
. and ~
is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)])
to generate '0|1|...|8|9'
or using np.isreal()
function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With