Comparing a dataframe on string lengths for different columns

Tags:

I am trying to get the string lengths for different columns. Seems quite straightforward with:

df['a'].str.len()

But I need to apply it to multiple columns. And then get the minimum on it.

Something like:

df[['a','b','c']].str.len().min

I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.

Also because of huge data, I am avoiding creating other columns to save on size.

295

asked Sep 29 '16 04:09

BernardL

1 Answers

I think you need list comprehension, because string function works only with Series (column):

print ([df[col].str.len().min() for col in ['a','b','c']])

Another solution with apply:

print ([df[col].apply(len).min() for col in ['a','b','c']])

Sample:

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':['st','dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})

print (df)

     a     b      c  d
0    h    st  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]

Timings:

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop

In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop

Conclusion:

apply is faster, but not works with None.

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':[None,'dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})


print (df)
     a     b      c  d
0    h  None  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].apply(len).min() for col in ['a','b','c']])

TypeError: object of type 'NoneType' has no len()

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]

EDIT by comment:

#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0    1
1    0
2    2
dtype: int64

#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0    1
1    0
2    2
dtype: int64

175

answered Sep 24 '22 16:09

jezrael

Related questions
                            
                                Draggable split-pane windows in flexbox can't get past child elements [duplicate]
                            
                                How to delay each ngFor items animation
                            
                                Concourse CI and Build number
                            
                                'No such file or directory' error after submitting a training job
                            
                                Custom UnmodifiableSetMixin Fails in Jackson 2.7+
                            
                                Object does not name a type - C++
                            
                                How to find the kth smallest element in an unsorted array or its segment?
                            
                                How to create a R data frame by row
                            
                                Why does the MIPS stack pointer need to be kept double word aligned?
                            
                                onClick method in fragment never gets called
                            
                                Scrapy (python) TypeError: unhashable type: 'list'
                            
                                Lazily create dask dataframe from generator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With