Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing a dataframe on string lengths for different columns

Tags:

I am trying to get the string lengths for different columns. Seems quite straightforward with:

df['a'].str.len()

But I need to apply it to multiple columns. And then get the minimum on it.

Something like:

df[['a','b','c']].str.len().min

I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.

Also because of huge data, I am avoiding creating other columns to save on size.

like image 295
BernardL Avatar asked Sep 29 '16 04:09

BernardL


People also ask

How do you compare pandas series values?

Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise. Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.

How do I compare two DataFrames in pandas and return differences?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.


1 Answers

I think you need list comprehension, because string function works only with Series (column):

print ([df[col].str.len().min() for col in ['a','b','c']])

Another solution with apply:

print ([df[col].apply(len).min() for col in ['a','b','c']])

Sample:

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':['st','dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})

print (df)

     a     b      c  d
0    h    st  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]

Timings:

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop

In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop

Conclusion:

apply is faster, but not works with None.

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':[None,'dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})


print (df)
     a     b      c  d
0    h  None  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].apply(len).min() for col in ['a','b','c']])

TypeError: object of type 'NoneType' has no len()

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]

EDIT by comment:

#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0    1
1    0
2    2
dtype: int64

#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0    1
1    0
2    2
dtype: int64
like image 175
jezrael Avatar answered Sep 24 '22 16:09

jezrael