How to drop rows not containing string type in a column in Pandas?

Tags:

I have a csv file with four columns. I read it like this:

df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])

Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.

636

asked Jun 29 '16 05:06

Harsh Wardhan

2 Answers

Setup

df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df

   A  B    C  D
0  a  b    c  d
1  e  f  1.2  g

Notice you can see what the individual cell types are.

print type(df.loc[0, 'C']), type(df.loc[1, 'C'])

<type 'str'> <type 'float'>

mask and slice

print df.loc[df.C.apply(type) != float]

   A  B  C  D
0  a  b  c  d

more general

print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]

   A  B  C  D
0  a  b  c  d

you could also use float as an attempt to determine if it can be a float.

def try_float(x):
    try:
        float(x)
        return True
    except:
        return False

print df.loc[~df.C.apply(try_float)]

   A  B  C  D
0  a  b  c  d

The problem with this approach is that you'll exclude strings that can be interpreted as floats.

Comparing times for the few options I've provided and also jezrael's solution with small dataframes.

enter image description here

For a dataframe with 500,000 rows:

enter image description here

Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.

125

answered Sep 28 '22 02:09

piRSquared

You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':['a',8,9],
                   'D':[1,3,5]})
print (df)
   A  B  C  D
0  1  4  a  1
1  2  5  8  3
2  3  6  9  5

print (pd.to_numeric(df.C, errors='coerce'))
0    NaN
1    8.0
2    9.0
Name: C, dtype: float64

print (pd.to_numeric(df.C, errors='coerce').isnull())
0     True
1    False
2    False
Name: C, dtype: bool

print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
   A  B  C  D
0  1  4  a  1

answered Sep 28 '22 00:09

jezrael

Related questions
                            
                                how to activate the ananconda's env python in emacs?
                            
                                How to make matplotlib/pandas bar chart look like hist chart?
                            
                                python Reportlab two items in the same row on a Paragraph
                            
                                Python equivalent for do.call(rbind, lapply()) from R
                            
                                Conform long lines to fit PEP 8
                            
                                Creating only one random prime number in provided range
                            
                                Django Model Formset: only track changes to those items that have been updated/saved in the set?
                            
                                Install numpy for python 2.7 and not 3.4
                            
                                Why Does this DataFrame Modification within Function Change Global Outside Function?
                            
                                pip uninstall working but giving error
                            
                                Compare string to bytes that works in both Python 2 and 3
                            
                                "Message: unknown error: cannot focus element" in python selenium driver
                            
                                Find route_table_id by subnet_id using boto3
                            
                                pandas: sum two rows of dataframe without rearranging dataframe?
                            
                                boto3 cannot create client on pyspark worker?
                            
                                Using a shift() function within an apply function to compare rows in a Pandas Dataframe
                            
                                PyQT5 QComboBox - get value of combobox
                            
                                XPath select image links - parent href link of img src only if it exists, else select img src link
                            
                                python date time get the current time but with seconds and hour and minute
                            
                                Issue with requests module in python for AWS Lambda

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to drop rows not containing string type in a column in Pandas?

Tags:

python

string

pandas

dataframe

python-2.7