I have a csv file with four columns. I read it like this:
df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])
Now, field C
contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Method 1: Using Dataframe. We can remove the last n rows using the drop() method. drop() method gets an inplace argument which takes a boolean value. If inplace attribute is set to True then the dataframe gets updated with the new value of dataframe (dataframe with last n rows removed).
df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df
A B C D
0 a b c d
1 e f 1.2 g
Notice you can see what the individual cell types are.
print type(df.loc[0, 'C']), type(df.loc[1, 'C'])
<type 'str'> <type 'float'>
mask and slice
print df.loc[df.C.apply(type) != float]
A B C D
0 a b c d
more general
print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]
A B C D
0 a b c d
you could also use float
as an attempt to determine if it can be a float.
def try_float(x):
try:
float(x)
return True
except:
return False
print df.loc[~df.C.apply(try_float)]
A B C D
0 a b c d
The problem with this approach is that you'll exclude strings that can be interpreted as floats.
Comparing times for the few options I've provided and also jezrael's solution with small dataframes.
For a dataframe with 500,000 rows:
Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.
You can use boolean indexing
with mask
created by to_numeric
with parameter errors='coerce'
- you get NaN
where are string values. Then check isnull
:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['a',8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 a 1
1 2 5 8 3
2 3 6 9 5
print (pd.to_numeric(df.C, errors='coerce'))
0 NaN
1 8.0
2 9.0
Name: C, dtype: float64
print (pd.to_numeric(df.C, errors='coerce').isnull())
0 True
1 False
2 False
Name: C, dtype: bool
print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
A B C D
0 1 4 a 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With