I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!
If your function is I/O bound, meaning that it is spending a lot of time waiting for data (e.g. making api requests over the internet), then multithreading (or thread pool) will be the better and faster option.
Cython (writing C extensions for pandas) For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy applications however, it can be possible to achieve sizable speed-ups by offloading work to cython.
Perhaps the most important rule is to avoid using loops in Pandas code. Looping over a Series or a DataFrame processes data one item or row/column at a time. Instead, operations should be vectorized. This means an operation should be performed on the entire Series or DataFrame row/column.
Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.
%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]
411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.
As for why apply
is faster, I'm of the belief that the function apply
is applying (i.e., str.split
) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With