Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas filter rows by two column values with case insenstive

Tags:

python

pandas

I have a simple dataframe as follows:

Last Known Date ConfigredValue  ReferenceValue
0   24-Jun-17   False   FALSE
1   25-Jun-17   FALSE   FALSE
2   26-Jun-17   TRUE    FALSE
3   27-Jun-17   FALSE   FALSE
4   28-Jun-17   false   FALSE

If I do the following command

df=df[df['ConfigredValue']!=dfs['ReferenceValue']]

then I get as below

0   24-Jun-17   False   FALSE
2   26-Jun-17   TRUE    FALSE
4   28-Jun-17   false   FALSE

But I want the filter with case insensitive (case=False)

I want following output:

2   26-Jun-17   TRUE    FALSE

Please suggest, how to get filtered case insensitive data(case=false)

like image 832
Python Spark Avatar asked Dec 14 '22 20:12

Python Spark


2 Answers

Option 1: convert to lowercase or to uppercase and compare

The simplest is to convert the two columns to lower (or to upper) before checking for equality:

df=df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]

or

df=df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 2: Compare the lengths

In this particuler case, you can simply compare the lengths of TRUE and True, they are the same wether the string is upper or lower case:

df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 3: Vectorized title

str.title() was also suggested in @0p3n5ourcE answer, here's a vectorized version of it:

df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]

Execution time

Benchmarking the speed shows that str.len() is a bit faster

In [35]: timeit df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]
1000 loops, best of 3: 496 µs per loop

In [36]: timeit df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]
1000 loops, best of 3: 496 µs per loop

In [37]: timeit df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]
1000 loops, best of 3: 495 µs per loop

In [38]: timeit df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]
1000 loops, best of 3: 479 µs per loop
like image 141
Mohamed Ali JAMAOUI Avatar answered Dec 16 '22 08:12

Mohamed Ali JAMAOUI


Better replace existing false with 'FALSE' with case = False parameter ie

df['ConfigredValue'] = df['ConfigredValue'].str.replace('false','FALSE',case=False)

df=df[df['ConfigredValue']!=df['ReferenceValue']]

Output:

   Last Known_Date ConfigredValue ReferenceValue
2     2  26-Jun-17           TRUE          FALSE
like image 33
Bharath Avatar answered Dec 16 '22 08:12

Bharath