Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: cannot filter based on string equality

Using pandas 0.16.2 on python 2.7, OSX.

I read a data-frame from a csv file like this:

import pandas as pd

data = pd.read_csv("my_csv_file.csv",sep='\t', skiprows=(0), header=(0))

The output of data.dtypes is:

name       object
weight     float64
ethnicity  object
dtype: object

I was expecting string types for name, and ethnicity. But I found reasons here on SO on why they're "object" in newer pandas versions.

Now, I want to select rows based on ethnicity, for example:

data[data['ethnicity']=='Asian']
Out[3]: 
Empty DataFrame
Columns: [name, weight, ethnicity]
Index: []

I get the same result with data[data.ethnicity=='Asian'] or data[data['ethnicity']=="Asian"].

But when I try the following:

data[data['ethnicity'].str.contains('Asian')].head(3)

I get the results I want.

However, I do not want to use "contains"- I would like to check for direct equality.

Please note that data[data['ethnicity'].str=='Asian'] raises an error.

Am I doing something wrong? How to do this correctly?

like image 379
vpk Avatar asked Jul 08 '15 21:07

vpk


People also ask

How do you use not contains in pandas DataFrame?

Getting rows where values do not contain substring in Pandas DataFrame. To get rows where values do not contain a substring, use str. contains(~) with the negation operator ~ .

How do you filter certain values in Python?

Python has a built-in function called filter() that allows you to filter a list (or a tuple) in a more beautiful way. The filter() function iterates over the elements of the list and applies the fn() function to each element. It returns an iterator for the elements where the fn() returns True .


1 Answers

There is probably whitespace in your strings, for example,

data = pd.DataFrame({'ethnicity':[' Asian', '  Asian']})
data.loc[data['ethnicity'].str.contains('Asian'), 'ethnicity'].tolist()
# [' Asian', '  Asian']
print(data[data['ethnicity'].str.contains('Asian')])

yields

  ethnicity
0     Asian
1     Asian

To strip the leading or trailing whitespace off the strings, you could use

data['ethnicity'] = data['ethnicity'].str.strip()

after which,

data.loc[data['ethnicity'] == 'Asian']

yields

  ethnicity
0     Asian
1     Asian
like image 156
unutbu Avatar answered Oct 26 '22 12:10

unutbu