<p>Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each <code>DTE</code> group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.</p> <p>Here's roughly my problem:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>DTE</th> <th>Strike</th> <th>Bid</th> <th>Ask</th> </tr></thead> <tbody> <tr> <td>1</td> <td>100</td> <td>10</td> <td>11</td> </tr> <tr> <td>1</td> <td>200</td> <td>16</td> <td>17</td> </tr> <tr> <td>1</td> <td>300</td> <td>17</td> <td>18</td> </tr> <tr> <td>1</td> <td>400</td> <td>11</td> <td>12</td> </tr> <tr> <td>1</td> <td>500</td> <td>12</td> <td>13</td> </tr> <tr> <td>1</td> <td>600</td> <td>13</td> <td>14</td> </tr> <tr> <td>2</td> <td>100</td> <td>10</td> <td>30</td> </tr> <tr> <td>2</td> <td>200</td> <td>15</td> <td>20</td> </tr> <tr> <td>2</td> <td>300</td> <td>16</td> <td>21</td> </tr> </tbody> </table> </div> <pre class="prettyprint"><code>import pandas as pd pd.DataFrame({ 'DTE': [1,1,1,1,1,1,2,2,2], 'Strike': [100,200,300,400,500,600,100,200,300], 'Bid': [10,16,17,11,12,13,10,15,16], 'Ask': [11,17,18,12,13,14,30,20,21], }) </code></pre> <p>I would like to:</p> <ul> <li>group these by <code>DTE</code>. Here we have two groups (DTE 1 and DTE 2). Then within each group...</li> <li>find the longest paired increasing subsequence. Sort-ordering is determined by <code>Strike</code>, which is unique for each DTE group. So 200 Strike comes after 100 Strike. <ul> <li>thus, the Bid <em>and</em> the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid <em>and</em> ask.</li> <li>any strikes in between that does NOT have bids and asks both increasing in value are deleted.</li> </ul> </li> </ul> <p>In this case, the answer would be:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>DTE</th> <th>Strike</th> <th>Bid</th> <th>Ask</th> </tr></thead> <tbody> <tr> <td>1</td> <td>100</td> <td>10</td> <td>11</td> </tr> <tr> <td>1</td> <td>400</td> <td>11</td> <td>12</td> </tr> <tr> <td>1</td> <td>500</td> <td>12</td> <td>13</td> </tr> <tr> <td>1</td> <td>600</td> <td>13</td> <td>14</td> </tr> <tr> <td>2</td> <td>200</td> <td>15</td> <td>20</td> </tr> <tr> <td>2</td> <td>300</td> <td>16</td> <td>21</td> </tr> </tbody> </table> </div> <p>Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.</p> <p>Note that standard Longest increasing subsequence algorithm of <code>O(nlogn)</code> does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.</p> <p>Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?</p> <p>My current solution looks like this:</p> <pre class="prettyprint"><code>def modify_lsc_row(row, df, longest_lsc): lsc_predecessor_count = 0 lsc_predecessor_index = -1 df_predecessors = df[(df['Bid'] <= row.Bid) & (df['Ask'] <= row.Ask) & (df['lsc_count'] != -1)] if len(df_predecessors) > 0: df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())] lsc_predecessor_index = df_predecessors.index.max() lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count'] new_predecessor_count = lsc_predecessor_count + 1 df.at[row.name, 'lsc_count'] = new_predecessor_count df.at[row.name, 'prev_index'] = lsc_predecessor_index if new_predecessor_count >= longest_lsc.lsc_count: longest_lsc.lsc_count = new_predecessor_count longest_lsc.lsc_index = row.name def longest_increasing_bid_ask_subsequence(df): original_columns = df.columns df.sort_values(['Strike'], ascending=True, inplace=True) df.set_index(['Strike'], inplace=True) assert df.index.is_unique longest_lsc = LongestLsc() longest_lsc.lsc_index = df.index.max() longest_lsc.lsc_count = 1 df['lsc_count'] = -1 df.apply(lambda row: modify_lsc_row(row, df, longest_lsc), axis=1) while longest_lsc.lsc_index != -1: df.at[longest_lsc.lsc_index, 'keep'] = True longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index'] df.dropna(inplace=True) return df.reset_index()[original_columns] df_groups = df.groupby(['DTE'], group_keys=False, as_index=False) df_groups.apply(longest_increasing_bid_ask_subsequence) </code></pre> <p>Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )</p> <p>Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.</p> <p>Updated code as follows:</p> <pre class="prettyprint"><code>@njit def set_list_indices(bids, asks, indices, indices_to_keep): entries = len(indices) lis_count = np.full(entries, 0) prev_index = np.full(entries, -1) longest_lis_count = -1 longest_lis_index = -1 for i in range(entries): predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0) best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1 if best_predecessor_index < i: prev_index[i] = best_predecessor_index new_count = predecessor_counts[best_predecessor_index] + 1 lis_count[i] = new_count if new_count >= longest_lis_count: longest_lis_count = new_count longest_lis_index = i while longest_lis_index != -1: indices_to_keep[indices[longest_lis_index]] = True longest_lis_index = prev_index[longest_lis_index] # necessary for lis algo, and groupby will preserve the order df = df.sort_values(['Strike'], ascending=True) # necessary for rows that were dropped. need reindexing for lis algo df = df.reset_index(drop=True) df_groups = df.groupby(['DTE']) row_indices_to_keep = np.full(len(df.index), False, dtype=bool) for name, group in df_groups: bids = group['Bid'].to_numpy() asks = group['Ask'].to_numpy() indices = group.index.to_numpy() set_list_indices(bids, asks, indices, row_indices_to_keep) df = df.iloc[row_indices_to_keep] </code></pre>

<p>What is the complexity of your algorithm of finding the longest increasing subsequence?</p> <p>This article provides an algorithm with the complexity of O(n log n). <strong>Upd</strong>: doesn't work. <s>You don't even need to modify the code, because in python comparison works for tuples: <code>assert (1, 2) < (3, 4)</code></s></p> <pre class="prettyprint"><code>>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)] >>> subsequence(seq) [(10, 11), (11, 12), (12, 13), (13, 14)] </code></pre> <blockquote> <p>Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?</p> </blockquote> <p>Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the <code>.groupby()</code>.</p> <pre class="prettyprint lang-py prettyprint-override"><code>from pandarallel import pandarallel pandarallel.initialize() # just an example of usage: df.groupby("DTE").parallel_apply(subsequence) </code></pre> <p>Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using <code>df.iloc</code></p>

Vectorization or efficient way to calculate Longest Increasing subsequence of tuples with Pandas

Tags:

python

pandas

vectorization

subsequence

quantitative-finance

Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.

Here's roughly my problem:

DTE	Strike	Bid	Ask
1	100	10	11
1	200	16	17
1	300	17	18
1	400	11	12
1	500	12	13
1	600	13	14
2	100	10	30
2	200	15	20
2	300	16	21

import pandas as pd
pd.DataFrame({
    'DTE': [1,1,1,1,1,1,2,2,2],
    'Strike': [100,200,300,400,500,600,100,200,300],
    'Bid': [10,16,17,11,12,13,10,15,16],
    'Ask': [11,17,18,12,13,14,30,20,21],
})

I would like to:

group these by DTE. Here we have two groups (DTE 1 and DTE 2). Then within each group...
find the longest paired increasing subsequence. Sort-ordering is determined by Strike, which is unique for each DTE group. So 200 Strike comes after 100 Strike.
- thus, the Bid and the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid and ask.
- any strikes in between that does NOT have bids and asks both increasing in value are deleted.

In this case, the answer would be:

DTE	Strike	Bid	Ask
1	100	10	11
1	400	11	12
1	500	12	13
1	600	13	14
2	200	15	20
2	300	16	21

Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.

Note that standard Longest increasing subsequence algorithm of O(nlogn) does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?

My current solution looks like this:

def modify_lsc_row(row, df, longest_lsc):
    lsc_predecessor_count = 0
    lsc_predecessor_index = -1
    df_predecessors = df[(df['Bid'] <= row.Bid) &
                         (df['Ask'] <= row.Ask) &
                         (df['lsc_count'] != -1)]
    if len(df_predecessors) > 0:
        df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
        lsc_predecessor_index = df_predecessors.index.max()
        lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']

    new_predecessor_count = lsc_predecessor_count + 1
    df.at[row.name, 'lsc_count'] = new_predecessor_count
    df.at[row.name, 'prev_index'] = lsc_predecessor_index

    if new_predecessor_count >= longest_lsc.lsc_count:
        longest_lsc.lsc_count = new_predecessor_count
        longest_lsc.lsc_index = row.name

def longest_increasing_bid_ask_subsequence(df):
    original_columns = df.columns
    df.sort_values(['Strike'], ascending=True, inplace=True)

    df.set_index(['Strike'], inplace=True)
    assert df.index.is_unique

    longest_lsc = LongestLsc()
    longest_lsc.lsc_index = df.index.max()
    longest_lsc.lsc_count = 1

    df['lsc_count'] = -1

    df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
                              axis=1)

    while longest_lsc.lsc_index != -1:
        df.at[longest_lsc.lsc_index, 'keep'] = True
        longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']

    df.dropna(inplace=True)


    return df.reset_index()[original_columns]


df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)

Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )

Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.

Updated code as follows:

@njit
def set_list_indices(bids, asks, indices, indices_to_keep):
    entries = len(indices)

    lis_count = np.full(entries, 0)
    prev_index = np.full(entries, -1)

    longest_lis_count = -1
    longest_lis_index = -1

    for i in range(entries):
        predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
        best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1

        if best_predecessor_index < i:
            prev_index[i] = best_predecessor_index

        new_count = predecessor_counts[best_predecessor_index] + 1
        lis_count[i] = new_count

        if new_count >= longest_lis_count:
            longest_lis_count = new_count
            longest_lis_index = i

    while longest_lis_index != -1:
        indices_to_keep[indices[longest_lis_index]] = True
        longest_lis_index = prev_index[longest_lis_index]


# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)

# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)

df_groups = df.groupby(['DTE'])

row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
    bids = group['Bid'].to_numpy()
    asks = group['Ask'].to_numpy()
    indices = group.index.to_numpy()
    set_list_indices(bids, asks, indices, row_indices_to_keep)

df = df.iloc[row_indices_to_keep]

756

asked May 26 '21 05:05

Erich Lin

Video Answer

1 Answers

What is the complexity of your algorithm of finding the longest increasing subsequence?

This article provides an algorithm with the complexity of O(n log n). Upd: doesn't work. ~~You don't even need to modify the code, because in python comparison works for tuples: assert (1, 2) < (3, 4)~~

>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)]
>>> subsequence(seq)
[(10, 11), (11, 12), (12, 13), (13, 14)]

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?

Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the .groupby().

from pandarallel import pandarallel
pandarallel.initialize()

# just an example of usage:
df.groupby("DTE").parallel_apply(subsequence)

Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using df.iloc

158

answered Oct 21 '22 21:10

Alexander Volkovsky

Related questions
                            
                                PyTorch how to compute second order Jacobian?
                            
                                Google OR-Tools TSP spanning multiple days with start/stop times
                            
                                typing.Any in Python 3.9 and PEP 585 - Type Hinting Generics In Standard Collections
                            
                                A function to return the frequency counts of all or specific columns
                            
                                Pyenv can't install python 3.5.2 in mac
                            
                                Pandas: Appending a row of boolean values to df using `loc` changes to `int`
                            
                                What is hp_metric in TensorBoard and how to get rid of it?
                            
                                Non-hashable static arguments are not supported in Jax when using vmap
                            
                                multiple inheritance python Issue
                            
                                Python import mechanism and module mocks
                            
                                Using a DB dependency in FastAPI without having to pass it through a function tree
                            
                                Broadcast and concatenate ragged tensors
                            
                                np.linalg.norm ord=2 not giving Euclidean norm
                            
                                Google Cloud Run does not load .env file
                            
                                Is it possible to run a local python script with a remote ssh interpreter via Visual Studio Code?
                            
                                Behavior of __new__ in a metaclass (also in context of inheritance)
                            
                                Python Multiprocessing Pool as Decorator
                            
                                Drawing a neural network
                            
                                Return value from one Airflow DAG into another one
                            
                                sample from randomly generated numbers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With