Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE
group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.
Here's roughly my problem:
DTE | Strike | Bid | Ask |
---|---|---|---|
1 | 100 | 10 | 11 |
1 | 200 | 16 | 17 |
1 | 300 | 17 | 18 |
1 | 400 | 11 | 12 |
1 | 500 | 12 | 13 |
1 | 600 | 13 | 14 |
2 | 100 | 10 | 30 |
2 | 200 | 15 | 20 |
2 | 300 | 16 | 21 |
import pandas as pd
pd.DataFrame({
'DTE': [1,1,1,1,1,1,2,2,2],
'Strike': [100,200,300,400,500,600,100,200,300],
'Bid': [10,16,17,11,12,13,10,15,16],
'Ask': [11,17,18,12,13,14,30,20,21],
})
I would like to:
DTE
. Here we have two groups (DTE 1 and DTE 2). Then within each group...Strike
, which is unique for each DTE group. So 200 Strike comes after 100 Strike.
In this case, the answer would be:
DTE | Strike | Bid | Ask |
---|---|---|---|
1 | 100 | 10 | 11 |
1 | 400 | 11 | 12 |
1 | 500 | 12 | 13 |
1 | 600 | 13 | 14 |
2 | 200 | 15 | 20 |
2 | 300 | 16 | 21 |
Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.
Note that standard Longest increasing subsequence algorithm of O(nlogn)
does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.
Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?
My current solution looks like this:
def modify_lsc_row(row, df, longest_lsc):
lsc_predecessor_count = 0
lsc_predecessor_index = -1
df_predecessors = df[(df['Bid'] <= row.Bid) &
(df['Ask'] <= row.Ask) &
(df['lsc_count'] != -1)]
if len(df_predecessors) > 0:
df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
lsc_predecessor_index = df_predecessors.index.max()
lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']
new_predecessor_count = lsc_predecessor_count + 1
df.at[row.name, 'lsc_count'] = new_predecessor_count
df.at[row.name, 'prev_index'] = lsc_predecessor_index
if new_predecessor_count >= longest_lsc.lsc_count:
longest_lsc.lsc_count = new_predecessor_count
longest_lsc.lsc_index = row.name
def longest_increasing_bid_ask_subsequence(df):
original_columns = df.columns
df.sort_values(['Strike'], ascending=True, inplace=True)
df.set_index(['Strike'], inplace=True)
assert df.index.is_unique
longest_lsc = LongestLsc()
longest_lsc.lsc_index = df.index.max()
longest_lsc.lsc_count = 1
df['lsc_count'] = -1
df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
axis=1)
while longest_lsc.lsc_index != -1:
df.at[longest_lsc.lsc_index, 'keep'] = True
longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']
df.dropna(inplace=True)
return df.reset_index()[original_columns]
df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)
Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )
Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.
Updated code as follows:
@njit
def set_list_indices(bids, asks, indices, indices_to_keep):
entries = len(indices)
lis_count = np.full(entries, 0)
prev_index = np.full(entries, -1)
longest_lis_count = -1
longest_lis_index = -1
for i in range(entries):
predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1
if best_predecessor_index < i:
prev_index[i] = best_predecessor_index
new_count = predecessor_counts[best_predecessor_index] + 1
lis_count[i] = new_count
if new_count >= longest_lis_count:
longest_lis_count = new_count
longest_lis_index = i
while longest_lis_index != -1:
indices_to_keep[indices[longest_lis_index]] = True
longest_lis_index = prev_index[longest_lis_index]
# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)
# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)
df_groups = df.groupby(['DTE'])
row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
bids = group['Bid'].to_numpy()
asks = group['Ask'].to_numpy()
indices = group.index.to_numpy()
set_list_indices(bids, asks, indices, row_indices_to_keep)
df = df.iloc[row_indices_to_keep]
The Longest Increasing Subsequence (LIS) problem is to find the length of the longest subsequence of a given sequence such that all elements of the subsequence are sorted in increasing order. For example, the length of LIS for {10, 22, 9, 33, 21, 50, 41, 60, 80} is 6 and LIS is {10, 22, 33, 50, 60, 80}.
Method 1: Recursion. Optimal Substructure: Let arr[0..n-1] be the input array and L(i) be the length of the LIS ending at index i such that arr[i] is the last element of the LIS. Then, L(i) can be recursively written as: L(i) = 1 + max( L(j) ) where 0 < j < i and arr[j] < arr[i]; or L(i) = 1, if no such j exists.
What is the complexity of your algorithm of finding the longest increasing subsequence?
This article provides an algorithm with the complexity of O(n log n).
Upd: doesn't work.
You don't even need to modify the code, because in python comparison works for tuples: assert (1, 2) < (3, 4)
>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)]
>>> subsequence(seq)
[(10, 11), (11, 12), (12, 13), (13, 14)]
Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?
Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the .groupby()
.
from pandarallel import pandarallel
pandarallel.initialize()
# just an example of usage:
df.groupby("DTE").parallel_apply(subsequence)
Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using df.iloc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With