Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to look for a value in pandas DataFrame?

I'm trying to "translate" some of my R scripts to Python, but I notice, that working with data frames in Python is tremendously slower than doing it in R, e.g. exctracting cells according to some conditions.

I've done a little investigation, this is how much time it takes to look for a specific value in Python:

import pandas as pd
from timeit import default_timer as timer

code = 145896

# real df is way bigger
df = pd.DataFrame(data={
    'code1': [145896, 800175, 633974, 774521, 416109],
    'code2': [100, 800, 600, 700, 400],
    'code3': [1, 8, 6, 7, 4]}
    )

start = timer()
for _ in range(100000):
    desired = df.loc[df['code1']==code, 'code2'][0]
print(timer() - start) # 19.866242500000226 (sec)

and in R:

code <- 145896

df <- data.frame("code1" = c(145896, 800175, 633974, 774521, 416109),
           "code2" = c(100, 800, 600, 700, 400),
           "code3" = c(1, 8, 6, 7, 4))

start <- Sys.time()
for (i in 1:100000) {
  desired <- df[df$code1 == code, "code2"]
}
print(Sys.time() - start) # Time difference of 1.140949 secs

I'm relatively new to Python, and I'm probably missing something. Is there some way to speed up this process? Maybe the whole idea of transferring this script to Python is pointless? In other operations Python is faster (namely working with strings), and it would be very inconvenient to jump between two or more scripts once working with data frames is required. Any help on this, please?

UPDATE Real script block iterates over rows of initial data frame (which is fairly large, 500-1500k rows) and creates a new one with rows, containing value from original column "code1" and codes, that correspond it, from another data frame, and many other values, that are newly created. I believe, I can clarify it with the picture: enter image description here

Later in the script I will need to search for specific values in loops based on different conditions too. So the speed of search is essential.

like image 406
leonefamily Avatar asked Dec 31 '22 13:12

leonefamily


2 Answers

Since you are looking to select a single value from a DataFrame there are a few things you can do to improve performance.

  1. Use .item() instead of [0], which has a small, but decent improvement especially for smaller DataFrames.
  2. It's wasteful to mask the entire DataFrame just to then select a known Series. Instead mask only the Series and select the value. Though you might think "oh this is chained -- the forbidden ][", it's only chained assignment which is worrisome, not chained selection.
  3. Use numpy. Pandas has a lot of overhead due to the indexing and alingment. But you just want to select a single value from a rectangular data structure, so dropping down to numpy will be faster.

Below are illustrations of the timing for different ways to select the data [Each with it's own method below]. Using numpy is by far the fastest, especially for a smaller DataFrame like in your sample. For those, it will be more than 20x faster than your original way to select data, which looking at your initial comparisons with R should make it slightly faster than selecting data in R. As the DataFrames get larger the relative performance of the numpy solution isn't as good, but it's still the fastest method (shown here).

import perfplot
import pandas as pd
import numpy as np

def DataFrame_Slice(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].iloc[0]

def DataFrame_Slice_Item(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].item()

def Series_Slice_Item(df, code=0):
    return df['code2'][df['code1'] == code].item()

def with_numpy(df, code=0):
    return df['code2'].to_numpy()[df['code1'].to_numpy() == code].item()


perfplot.show(
    setup=lambda N: pd.DataFrame({'code1': range(N),
                                  'code2': range(50, N+50),
                                  'code3': range(100, N+100)}),
    kernels=[
        lambda df: DataFrame_Slice(df),
        lambda df: DataFrame_Slice_Item(df),
        lambda df: Series_Slice_Item(df),
        lambda df: with_numpy(df)
    ],
    labels=['DataFrame_Slice', 'DataFrame_Slice_Item', 'Series_Slice_Item', 'with_numpy'],
    n_range=[2 ** k for k in range(1, 21)],
    equality_check=np.allclose,  
    relative_to=3,
    xlabel='len(df)'
)

enter image description here

like image 125
ALollz Avatar answered Jan 02 '23 02:01

ALollz


You can cut it in about half just by reusing the filter expression.

In [1]: import pandas as pd

In [2]: code = 145896
   ...: df = pd.DataFrame(data={
   ...:     'code1': [145896, 800175, 633974, 774521, 416109],
   ...:     'code2': [100, 800, 600, 700, 400],
   ...:     'code3': [1, 8, 6, 7, 4]
   ...: })

In [3]: %timeit df.loc[df['code1'] == code, 'code2'][0]
197 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: filter_expr = df['code1'] == code

In [5]: %timeit df.loc[filter_expr, 'code2'][0]
106 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Indexing the dataframe by the key column (assuming the lookup is frequent) should be the way to go because the dataframe's index is a hash table (see this answer and these slides for more details).

In [6]: df_idx = df.set_index('code1')

In [7]: %timeit df_idx.loc[code]['code2']
72.7 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And depending on other use cases that you have, having data a true embedded (in-memory) database, SQLite or DuckDB (can run queries directly on Pandas data without ever importing or copying any data), may also be a solution.

like image 35
saaj Avatar answered Jan 02 '23 03:01

saaj