I basically want to learn a faster way to slice a Pandas dataframe with conditional slicing based on regex. For example the following df (there's more than 4 variations in the string_column, they are only for illustrative purposes):
index, string_col1, string_col2, value
0, 'apple', 'this', 10
1, 'pen', 'is', 123
2, 'pineapple', 'sparta', 20
3, 'pen pineapple apple pen', 'this', 234
4, 'apple', 'is', 212
5, 'pen', 'sparta', 50
6, 'pineapple', 'this', 69
7, 'pen pineapple apple pen', 'is', 79
8, 'apple pen', 'sparta again', 78
...
100000, 'pen pineapple apple pen', 'this is sparta', 392
I have to do Boolean conditional slicing according to the string_column using regex, while finding the indices with minimum and maximum in the value column, and then finally finding the difference between the min and max value. I do this by the following method, but it's SUPER SLOW when I have to match many different regex patterns:
pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2))
max_idx = df[mask].idxmax()
min_idx = df[mask].idxmin()
difference = df['value'].loc[max_idx] - df['value'].loc[min_idx]
I think to get one "difference" answer, I'm slicing the df too many times, but I can't figure out how to do it less. Furthermore, is there a faster way to slice it?
This is an optimization question since I know my code gets me what I need. Any tips will be appreciated!
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
Slicing Rows and Columns by Index PositionWhen slicing by index position in Pandas, the start index is included in the output, but the stop index is one step beyond the row you want to select. So the slice return row 0 and row 1, but does not return row 2. The second slice [:] indicates that all columns are required.
Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster.
I've been trying to profile your example, but I'm actually getting pretty great performance on my synthetic data, so I may need some clarification. (Also, for some reason .idxmax() breaks for me whenever I have a string in my dataframe).
Here's my testing code:
import pandas as pd
import re
import numpy as np
import random
import IPython
from timeit import default_timer as timer
possibilities_col1 = ['apple', 'pen', 'pineapple', 'joseph', 'cauliflower']
possibilities_col2 = ['sparta', 'this', 'is', 'again']
entries = 100000
potential_words_col1 = 4
potential_words_col2 = 3
def create_function_col1():
result = []
for x in range(random.randint(1, potential_words_col1)):
result.append(random.choice(possibilities_col1))
return " ".join(result)
def create_function_col2():
result = []
for x in range(random.randint(1, potential_words_col2)):
result.append(random.choice(possibilities_col2))
return " ".join(result)
data = {'string_col1': pd.Series([create_function_col1() for _ in range(entries)]),
'string_col2': pd.Series([create_function_col2() for _ in range(entries)]),
'value': pd.Series([random.randint(1, 500) for _ in range(entries)])}
df = pd.DataFrame(data)
pat1 = re.compile('apple')
pat2 = re.compile('sparta')
pat3 = re.compile('pineapple')
pat4 = re.compile('this')
#IPython.embed()
start = timer()
mask = df['string_col1'].str.contains(pat1) & \
df['string_col1'].str.contains(pat3) & \
df['string_col2'].str.contains(pat2) & \
df['string_col2'].str.contains(pat4)
valid = df[mask]
max_idx = valid['value'].argmax()
min_idx = valid['value'].argmin()
#max_idx = result['max']
#min_idx = result['min']
difference = df.loc[max_idx, 'value'] - df.loc[min_idx, 'value']
end = timer()
print("Difference: {}".format(difference))
print("# Valid: {}".format(len(valid)))
print("Time Elapsed: {}".format(end-start))
Can you explain how many conditions you're applying? (Each regex I add only adds a roughly linear increase in time (i.e. 2->3 regex means a 1.5x increase in run time)). I'm also getting linear scaling on the number of entries, and both potential string lengths (the potential_words variables).
For reference, this code is evaluating in ~ .15 seconds on my machine (1 million entries takes ~1.5 seconds).
Edit: I'm an idiot and wasn't doing the same thing you were (I was taking the difference between values at the smallest and largest indices in the dataset, not the difference between the smallest and largest values), but fixing it didn't really add much in the way of runtime.
Edit 2: How does idxmax() know which column to select a maximum along in your example code?
Pass each mask to next subset of dataframe, each new filtering happens on a smaller subset of the original dataframe:
pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask1 = df['string_col1'].str.contains(pat1)
mask = (df[mask1]['string_col2'].str.contains(pat2))
df1=df[mask1][mask]
max_idx = df1['value'].idxmax()
min_idx = df1['value'].idxmin()
a,b=df1['value'].loc[max_idx],df1['value'].loc[min_idx]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With