Is there a faster way to find the length of the longest string in a Pandas DataFrame than what's shown in the example below?
import numpy as np import pandas as pd x = ['ab', 'bcd', 'dfe', 'efghik'] x = np.repeat(x, 1e7) df = pd.DataFrame(x, columns=['col1']) print df.col1.map(lambda x: len(x)).max() # result --> 6
It takes about 10 seconds to run df.col1.map(lambda x: len(x)).max()
when timing it with IPython's %timeit
.
To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.
Use Python's built-in max() function with a key argument to find the longest string in a list. Call max(lst, key=len) to return the longest string in lst using the built-in len() function to associate the weight of each string—the longest string will be the maximum.
DSM's suggestion seems to be about the best you're going to get without doing some manual microoptimization:
%timeit -n 100 df.col1.str.len().max() 100 loops, best of 3: 11.7 ms per loop %timeit -n 100 df.col1.map(lambda x: len(x)).max() 100 loops, best of 3: 16.4 ms per loop %timeit -n 100 df.col1.map(len).max() 100 loops, best of 3: 10.1 ms per loop
Note that explicitly using the str.len()
method doesn't seem to be much of an improvement. If you're not familiar with IPython, which is where that very convenient %timeit
syntax comes from, I'd definitely suggest giving it a shot for quick testing of things like this.
Update Added screenshot:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With