Slicing Pandas rows with string match slow

Tags:

I basically want to learn a faster way to slice a Pandas dataframe with conditional slicing based on regex. For example the following df (there's more than 4 variations in the string_column, they are only for illustrative purposes):

index, string_col1, string_col2, value
0, 'apple', 'this', 10
1, 'pen', 'is', 123
2, 'pineapple', 'sparta', 20
3, 'pen pineapple apple pen', 'this', 234
4, 'apple', 'is', 212
5, 'pen', 'sparta', 50
6, 'pineapple', 'this', 69
7, 'pen pineapple apple pen', 'is',  79
8, 'apple pen', 'sparta again', 78
...
100000, 'pen pineapple apple pen', 'this is sparta', 392

I have to do Boolean conditional slicing according to the string_column using regex, while finding the indices with minimum and maximum in the value column, and then finally finding the difference between the min and max value. I do this by the following method, but it's SUPER SLOW when I have to match many different regex patterns:

pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2))
max_idx = df[mask].idxmax()
min_idx = df[mask].idxmin()
difference = df['value'].loc[max_idx] - df['value'].loc[min_idx]

I think to get one "difference" answer, I'm slicing the df too many times, but I can't figure out how to do it less. Furthermore, is there a faster way to slice it?

This is an optimization question since I know my code gets me what I need. Any tips will be appreciated!

257

asked Jul 20 '17 17:07

Heavy Breathing

2 Answers

I've been trying to profile your example, but I'm actually getting pretty great performance on my synthetic data, so I may need some clarification. (Also, for some reason .idxmax() breaks for me whenever I have a string in my dataframe).

Here's my testing code:

import pandas as pd
import re
import numpy as np
import random
import IPython
from timeit import default_timer as timer

possibilities_col1 = ['apple', 'pen', 'pineapple', 'joseph', 'cauliflower']
possibilities_col2 = ['sparta', 'this', 'is', 'again']
entries = 100000
potential_words_col1 = 4
potential_words_col2 = 3
def create_function_col1():
    result = []
    for x in range(random.randint(1, potential_words_col1)):
        result.append(random.choice(possibilities_col1))
    return " ".join(result)

def create_function_col2():
    result = []
    for x in range(random.randint(1, potential_words_col2)):
        result.append(random.choice(possibilities_col2))
    return " ".join(result)

data = {'string_col1': pd.Series([create_function_col1() for _ in range(entries)]),
        'string_col2': pd.Series([create_function_col2() for _ in range(entries)]),
        'value': pd.Series([random.randint(1, 500) for _ in range(entries)])}


df = pd.DataFrame(data)
pat1 = re.compile('apple')
pat2 = re.compile('sparta')
pat3 = re.compile('pineapple')
pat4 = re.compile('this')
#IPython.embed()
start = timer()
mask = df['string_col1'].str.contains(pat1) & \
       df['string_col1'].str.contains(pat3) & \
       df['string_col2'].str.contains(pat2) & \
       df['string_col2'].str.contains(pat4)
valid = df[mask]
max_idx = valid['value'].argmax()
min_idx = valid['value'].argmin()
#max_idx = result['max']
#min_idx = result['min']
difference = df.loc[max_idx, 'value'] - df.loc[min_idx, 'value']
end = timer()
print("Difference: {}".format(difference))
print("# Valid: {}".format(len(valid)))
print("Time Elapsed: {}".format(end-start))

Can you explain how many conditions you're applying? (Each regex I add only adds a roughly linear increase in time (i.e. 2->3 regex means a 1.5x increase in run time)). I'm also getting linear scaling on the number of entries, and both potential string lengths (the potential_words variables).

For reference, this code is evaluating in ~ .15 seconds on my machine (1 million entries takes ~1.5 seconds).

Edit: I'm an idiot and wasn't doing the same thing you were (I was taking the difference between values at the smallest and largest indices in the dataset, not the difference between the smallest and largest values), but fixing it didn't really add much in the way of runtime.

Edit 2: How does idxmax() know which column to select a maximum along in your example code?

131

answered Oct 26 '22 19:10

Saedeas

Pass each mask to next subset of dataframe, each new filtering happens on a smaller subset of the original dataframe:

pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask1 = df['string_col1'].str.contains(pat1)
mask = (df[mask1]['string_col2'].str.contains(pat2))
df1=df[mask1][mask]
max_idx = df1['value'].idxmax()
min_idx = df1['value'].idxmin()
a,b=df1['value'].loc[max_idx],df1['value'].loc[min_idx]

answered Oct 26 '22 18:10

denfromufa

Related questions
                            
                                how can you tell if github repository is for python 2 or python 3
                            
                                Does sklearn have group lasso?
                            
                                TensorFlow ValueError: Variable does not exist, or was not created with tf.get_variable()
                            
                                Split array into equal sized windows
                            
                                Python/Splinter: How to find and select an option on a site?
                            
                                Which tool can I trust?
                            
                                Tensorflow: is it possible to create 2D LSTM?
                            
                                How do I view data object contents within an npz file?
                            
                                Pandas update Dataframe with Dictionary
                            
                                Implementing Adagrad in Python
                            
                                How can I vectorize a function that uses lagged values of its own output?
                            
                                How to speedup my tensorflow execution on hadoop?
                            
                                Type conversion for namedtuple fields during initialization
                            
                                Python photo mosaic with abstractly shaped mosaics
                            
                                Default values for iterable unpacking
                            
                                Curses.init_color() won't take effect
                            
                                Monitoring system with events in Python
                            
                                Beautiful Soup Can't Find Tags
                            
                                How to run Python script on USB flash-drive insertion
                            
                                Protecting Workbook in openpyxl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Slicing Pandas rows with string match slow

Tags:

python

optimization

pandas

numpy

Heavy Breathing

People also ask

2 Answers

Saedeas

denfromufa

Recent Activity

Donate For Us