Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop rows not containing string type in a column in Pandas?

I have a csv file with four columns. I read it like this:

df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])

Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.

like image 636
Harsh Wardhan Avatar asked Jun 29 '16 05:06

Harsh Wardhan


People also ask

How do you drop rows in Pandas based on column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

How do I delete rows in Pandas Dataframe based on string condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you drop unnecessary rows in Pandas?

Method 1: Using Dataframe. We can remove the last n rows using the drop() method. drop() method gets an inplace argument which takes a boolean value. If inplace attribute is set to True then the dataframe gets updated with the new value of dataframe (dataframe with last n rows removed).


2 Answers

Setup

df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df

   A  B    C  D
0  a  b    c  d
1  e  f  1.2  g

Notice you can see what the individual cell types are.

print type(df.loc[0, 'C']), type(df.loc[1, 'C'])

<type 'str'> <type 'float'>

mask and slice

print df.loc[df.C.apply(type) != float]

   A  B  C  D
0  a  b  c  d

more general

print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]

   A  B  C  D
0  a  b  c  d

you could also use float as an attempt to determine if it can be a float.

def try_float(x):
    try:
        float(x)
        return True
    except:
        return False

print df.loc[~df.C.apply(try_float)]

   A  B  C  D
0  a  b  c  d

The problem with this approach is that you'll exclude strings that can be interpreted as floats.

Comparing times for the few options I've provided and also jezrael's solution with small dataframes.

enter image description here

For a dataframe with 500,000 rows:

enter image description here

Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.

like image 125
piRSquared Avatar answered Sep 28 '22 02:09

piRSquared


You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':['a',8,9],
                   'D':[1,3,5]})
print (df)
   A  B  C  D
0  1  4  a  1
1  2  5  8  3
2  3  6  9  5

print (pd.to_numeric(df.C, errors='coerce'))
0    NaN
1    8.0
2    9.0
Name: C, dtype: float64

print (pd.to_numeric(df.C, errors='coerce').isnull())
0     True
1    False
2    False
Name: C, dtype: bool

print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
   A  B  C  D
0  1  4  a  1
like image 30
jezrael Avatar answered Sep 28 '22 00:09

jezrael