Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe performance vs list performance

Tags:

python

pandas

I'm comparing two dataframes to determine if rows in df1 begin any row in df2. df1 is on the order of a thousand entries, df2 is in the millions.

This does the job but is rather slow.

df1['name'].map(lambda x: any(df2['name'].str.startswith(x)))

When run on a subset of df1 (10 items), this is the result:

35243     True
39980    False
40641    False
45974    False
53788    False
59895     True
61856    False
81083     True
83054     True
87717    False
Name: name, dtype: bool
Time: 57.8873581886 secs

When I converted df2 to a list, it runs much faster:

df2_list = df2['name'].tolist()

df1['name'].map(lambda x: any(item.startswith(x + ' ') for item in df2_list))

35243     True
39980    False
40641    False
45974    False
53788    False
59895     True
61856    False
81083     True
83054     True
87717    False
Name: name, dtype: bool
Time: 33.0746209621 secs

Why is it quicker to iterate through a list than a Series?

like image 516
marie Avatar asked Sep 28 '16 00:09

marie


People also ask

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Is pandas apply faster than list comprehension?

I wrote a very easy and fast script to benchmark built-in list comprehension and map functions vs pandas apply and map methods. Conclusion : built-in map and list comprehension are much more faster than pandas methods.

Are arrays faster than DataFrames?

Numpy arrays are faster than DataFrame on normal mathematical operations.

How can I make my pandas 100x faster?

apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.


1 Answers

any() will early return when it get a True value, thus the startswith() calls is less then the Dataframe version.

Here is a method that use searchsorted():

import random, string
import pandas as pd
import numpy as np

def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


xs = pd.Series([randomword(3) for _ in range(1000)])
ys = pd.Series([randomword(10) for _ in range(10000)])

def is_any_prefix1(xs, ys):
    yo = ys.sort_values().reset_index(drop=True)
    y2 = yo[yo.searchsorted(xs)]
    return np.fromiter(map(str.startswith, y2, xs), dtype=bool)

def is_any_prefix2(xs, ys):
    x = xs.tolist()
    y = ys.tolist()
    return np.fromiter((any(yi.startswith(xi) for yi in y) for xi in x), dtype=bool)

res1 = is_any_prefix1(xs, ys)
res2 = is_any_prefix2(xs, ys)
print(np.all(res1 == res2))

%timeit is_any_prefix1(xs, ys)
%timeit is_any_prefix2(xs, ys)

output:

True
100 loops, best of 3: 17.8 ms per loop
1 loop, best of 3: 2.35 s per loop

It's 100x faster.

like image 52
HYRY Avatar answered Oct 13 '22 02:10

HYRY