I'm comparing two dataframes to determine if rows in df1 begin any row in df2. df1 is on the order of a thousand entries, df2 is in the millions.
This does the job but is rather slow.
df1['name'].map(lambda x: any(df2['name'].str.startswith(x)))
When run on a subset of df1 (10 items), this is the result:
35243 True
39980 False
40641 False
45974 False
53788 False
59895 True
61856 False
81083 True
83054 True
87717 False
Name: name, dtype: bool
Time: 57.8873581886 secs
When I converted df2 to a list, it runs much faster:
df2_list = df2['name'].tolist()
df1['name'].map(lambda x: any(item.startswith(x + ' ') for item in df2_list))
35243 True
39980 False
40641 False
45974 False
53788 False
59895 True
61856 False
81083 True
83054 True
87717 False
Name: name, dtype: bool
Time: 33.0746209621 secs
Why is it quicker to iterate through a list than a Series?
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
I wrote a very easy and fast script to benchmark built-in list comprehension and map functions vs pandas apply and map methods. Conclusion : built-in map and list comprehension are much more faster than pandas methods.
Numpy arrays are faster than DataFrame on normal mathematical operations.
apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.
any()
will early return when it get a True
value, thus the startswith()
calls is less then the Dataframe
version.
Here is a method that use searchsorted()
:
import random, string
import pandas as pd
import numpy as np
def randomword(length):
return ''.join(random.choice(string.ascii_lowercase) for i in range(length))
xs = pd.Series([randomword(3) for _ in range(1000)])
ys = pd.Series([randomword(10) for _ in range(10000)])
def is_any_prefix1(xs, ys):
yo = ys.sort_values().reset_index(drop=True)
y2 = yo[yo.searchsorted(xs)]
return np.fromiter(map(str.startswith, y2, xs), dtype=bool)
def is_any_prefix2(xs, ys):
x = xs.tolist()
y = ys.tolist()
return np.fromiter((any(yi.startswith(xi) for yi in y) for xi in x), dtype=bool)
res1 = is_any_prefix1(xs, ys)
res2 = is_any_prefix2(xs, ys)
print(np.all(res1 == res2))
%timeit is_any_prefix1(xs, ys)
%timeit is_any_prefix2(xs, ys)
output:
True
100 loops, best of 3: 17.8 ms per loop
1 loop, best of 3: 2.35 s per loop
It's 100x faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With