Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorization or efficient way to calculate Longest Increasing subsequence of tuples with Pandas

Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.

Here's roughly my problem:

DTE Strike Bid Ask
1 100 10 11
1 200 16 17
1 300 17 18
1 400 11 12
1 500 12 13
1 600 13 14
2 100 10 30
2 200 15 20
2 300 16 21
import pandas as pd
pd.DataFrame({
    'DTE': [1,1,1,1,1,1,2,2,2],
    'Strike': [100,200,300,400,500,600,100,200,300],
    'Bid': [10,16,17,11,12,13,10,15,16],
    'Ask': [11,17,18,12,13,14,30,20,21],
})

I would like to:

  • group these by DTE. Here we have two groups (DTE 1 and DTE 2). Then within each group...
  • find the longest paired increasing subsequence. Sort-ordering is determined by Strike, which is unique for each DTE group. So 200 Strike comes after 100 Strike.
    • thus, the Bid and the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid and ask.
    • any strikes in between that does NOT have bids and asks both increasing in value are deleted.

In this case, the answer would be:

DTE Strike Bid Ask
1 100 10 11
1 400 11 12
1 500 12 13
1 600 13 14
2 200 15 20
2 300 16 21

Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.

Note that standard Longest increasing subsequence algorithm of O(nlogn) does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?

My current solution looks like this:

def modify_lsc_row(row, df, longest_lsc):
    lsc_predecessor_count = 0
    lsc_predecessor_index = -1
    df_predecessors = df[(df['Bid'] <= row.Bid) &
                         (df['Ask'] <= row.Ask) &
                         (df['lsc_count'] != -1)]
    if len(df_predecessors) > 0:
        df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
        lsc_predecessor_index = df_predecessors.index.max()
        lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']

    new_predecessor_count = lsc_predecessor_count + 1
    df.at[row.name, 'lsc_count'] = new_predecessor_count
    df.at[row.name, 'prev_index'] = lsc_predecessor_index

    if new_predecessor_count >= longest_lsc.lsc_count:
        longest_lsc.lsc_count = new_predecessor_count
        longest_lsc.lsc_index = row.name

def longest_increasing_bid_ask_subsequence(df):
    original_columns = df.columns
    df.sort_values(['Strike'], ascending=True, inplace=True)

    df.set_index(['Strike'], inplace=True)
    assert df.index.is_unique

    longest_lsc = LongestLsc()
    longest_lsc.lsc_index = df.index.max()
    longest_lsc.lsc_count = 1

    df['lsc_count'] = -1

    df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
                              axis=1)

    while longest_lsc.lsc_index != -1:
        df.at[longest_lsc.lsc_index, 'keep'] = True
        longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']

    df.dropna(inplace=True)


    return df.reset_index()[original_columns]


df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)

Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )

Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.

Updated code as follows:

@njit
def set_list_indices(bids, asks, indices, indices_to_keep):
    entries = len(indices)

    lis_count = np.full(entries, 0)
    prev_index = np.full(entries, -1)

    longest_lis_count = -1
    longest_lis_index = -1

    for i in range(entries):
        predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
        best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1

        if best_predecessor_index < i:
            prev_index[i] = best_predecessor_index

        new_count = predecessor_counts[best_predecessor_index] + 1
        lis_count[i] = new_count

        if new_count >= longest_lis_count:
            longest_lis_count = new_count
            longest_lis_index = i

    while longest_lis_index != -1:
        indices_to_keep[indices[longest_lis_index]] = True
        longest_lis_index = prev_index[longest_lis_index]


# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)

# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)

df_groups = df.groupby(['DTE'])

row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
    bids = group['Bid'].to_numpy()
    asks = group['Ask'].to_numpy()
    indices = group.index.to_numpy()
    set_list_indices(bids, asks, indices, row_indices_to_keep)

df = df.iloc[row_indices_to_keep]
like image 756
Erich Lin Avatar asked May 26 '21 05:05

Erich Lin


People also ask

How do you find the longest increasing subsequence in Python?

The Longest Increasing Subsequence (LIS) problem is to find the length of the longest subsequence of a given sequence such that all elements of the subsequence are sorted in increasing order. For example, the length of LIS for {10, 22, 9, 33, 21, 50, 41, 60, 80} is 6 and LIS is {10, 22, 33, 50, 60, 80}.

How do you solve the longest increasing subsequence problem?

Method 1: Recursion. Optimal Substructure: Let arr[0..n-1] be the input array and L(i) be the length of the LIS ending at index i such that arr[i] is the last element of the LIS. Then, L(i) can be recursively written as: L(i) = 1 + max( L(j) ) where 0 < j < i and arr[j] < arr[i]; or L(i) = 1, if no such j exists.


Video Answer


1 Answers

What is the complexity of your algorithm of finding the longest increasing subsequence?

This article provides an algorithm with the complexity of O(n log n). Upd: doesn't work. You don't even need to modify the code, because in python comparison works for tuples: assert (1, 2) < (3, 4)

>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)]
>>> subsequence(seq)
[(10, 11), (11, 12), (12, 13), (13, 14)]

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?

Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the .groupby().

from pandarallel import pandarallel
pandarallel.initialize()

# just an example of usage:
df.groupby("DTE").parallel_apply(subsequence)

Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using df.iloc

like image 158
Alexander Volkovsky Avatar answered Oct 21 '22 21:10

Alexander Volkovsky