<p>Hi I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look. The real problem is quite complex, so I'll try to explain this with a simple example (took me quite a while to simplify it :)):</p> <p>Assume we have an airport with two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Everything is stored in a Pandas DataFrame, which is sorted by the arrival time, as follows (see <strong>EDIT2</strong> for a bigger dataset for testing) :</p> <pre class="prettyprint"><code>PLANE STRIP ARRIVAL DEPARTURE 0 1 85.00 86.00 1 1 87.87 92.76 2 2 88.34 89.72 3 1 88.92 90.88 4 2 90.03 92.77 5 2 90.27 91.95 6 2 92.42 93.58 7 2 94.42 95.58 </code></pre> <p>Looking for solutions to two cases:</p> <p><strong>1.</strong> Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show [3,4] if there is a valid [3,4,5] case). The list should store the indices of the actual DataFrame rows. See function findSingleEvents() for a solution for this case (runs around 5 ms).</p> <p><strong>2.</strong> Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show [3,4] if there is a [3,4,5] case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See function findMultiEvents() for a solution for this case (runs around 15 ms).</p> <p><strong>Working Code:</strong></p> <pre class="prettyprint"><code>import numpy as np import pandas as pd import itertools from __future__ import division data = [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00}, {'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76}, {'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72}, {'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88}, {'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77}, {'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95}, {'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58}, {'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}] df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE']) def findSingleEvents(df): events = [] for row in df.itertuples(): #Create temporary dataframe for each main iteration dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)] if len(dfTemp)>1: #convert index values to integers from long current_event = [int(v) for v in dfTemp.index.tolist()] #loop backwards to remove elements that do not comply for i in reversed(current_event): if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any(): current_event.remove(i) events.append(current_event) #remove duplicate events events = map(list, set(map(tuple, events))) return events def findMultiEvents(df): events = [] for row in df.itertuples(): #Create temporary dataframe for each main iteration dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)] if len(dfTemp)>1: #convert index values to integers from long current_event = [int(v) for v in dfTemp.index.tolist()] #loop backwards to remove elements that do not comply for i in reversed(current_event): if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any(): current_event.remove(i) #remove elements only on 1 strip if len(df.iloc[current_event].STRIP.unique()) > 1: events.append(current_event) #remove duplicate events events = map(list, set(map(tuple, events))) return events print findSingleEvents(df[df.STRIP==1]) print findSingleEvents(df[df.STRIP==2]) print findMultiEvents(df) </code></pre> <p><strong>Verified Output:</strong></p> <pre class="prettyprint"><code>[[1, 3]] [[4, 5], [4, 6]] [[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]] </code></pre> <p>Obviously, these are neither efficient nor elegant solutions. With the huge DataFrame I have, running this will probably take hours. I thought about a vectorized approach quite a while, but could not come up anything solid. Any pointers/help would be welcome! I am also open to Numpy/Cython/Numba based approches.</p> <p>Thanks!</p> <p><strong>PS:</strong> If you wonder what I will do with the lists: I will assign an <code>EVENT</code> number to each <code>EVENT</code>, and build a separate database with merging the data above, and the <code>EVENT</code> numbers as a separate column, to be used for something else. For Case 1, it will look something like this:</p> <pre class="prettyprint"><code>EVENT PLANE STRIP ARRIVAL DEPARTURE 0 4 2 90.03 92.77 0 5 2 90.27 91.95 1 5 2 90.27 91.95 1 6 2 92.42 95.58 </code></pre> <p><strong>EDIT:</strong> Revised the code and the test data set.</p> <p><strong>EDIT2:</strong> Use the code below to generate a 1000 row (or more) long DataFrame for testing purposes. (per @ImportanceOfBeingErnest 's recommendation)</p> <pre class="prettyprint"><code>import random import pandas as pd import numpy as np data = [] for i in range(1000): arrival = random.uniform(0,1000) departure = arrival + random.uniform(2.0, 10.0) data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure}) df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE']) df = df.sort_values(by=['ARRIVAL']) df = df.reset_index(drop=True) df.PLANE = df.index </code></pre> <p><strong>EDIT3:</strong></p> <p>Modified version of the accepted answer. The accepted answer was not able to remove subsets of events. Modified version satisfies the rule "(e.g. do not show [3,4] if there is a valid [3,4,5] case)"</p> <pre class="prettyprint"><code>def maximal_subsets_modified(sets): sets.sort() maximal_sets = [] s0 = frozenset() for s in sets: if not (s > s0) and len(s0) > 1: not_in_list = True for x in maximal_sets: if set(x).issubset(set(s0)): maximal_sets.remove(x) if set(s0).issubset(set(x)): not_in_list = False if not_in_list: maximal_sets.append(list(s0)) s0 = s if len(s0) > 1: not_in_list = True for x in maximal_sets: if set(x).issubset(set(s0)): maximal_sets.remove(x) if set(s0).issubset(set(x)): not_in_list = False if not_in_list: maximal_sets.append(list(s0)) return maximal_sets def maximal_subsets_2_modified(sets, d): sets.sort() maximal_sets = [] s0 = frozenset() for s in sets: if not (s > s0) and len(s0) > 1 and d.loc[list(s0), 'STRIP'].nunique() == 2: not_in_list = True for x in maximal_sets: if set(x).issubset(set(s0)): maximal_sets.remove(x) if set(s0).issubset(set(x)): not_in_list = False if not_in_list: maximal_sets.append(list(s0)) s0 = s if len(s0) > 1 and d.loc[list(s), 'STRIP'].nunique() == 2: not_in_list = True for x in maximal_sets: if set(x).issubset(set(s0)): maximal_sets.remove(x) if set(s0).issubset(set(x)): not_in_list = False if not_in_list: maximal_sets.append(list(s0)) return maximal_sets # single def hal_3_modified(d): sets = np.apply_along_axis( lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 1, d.values ) return maximal_subsets_modified(sets) # multi def hal_5_modified(d): sets = np.apply_along_axis( lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 1, d.values ) return maximal_subsets_2_modified(sets, d) </code></pre>

<p>I rewrote the solution using <code>DataFrame.apply</code> instead of iterating, and as optimization used numpy arrays wherever possible. I used <code>frozenset</code> because they are immutable and hashable and thus <code>Series.unique</code> works properly. <code>Series.unique</code> fails on elements of type <code>set</code>.</p> <p>Also, I found <code>d.loc[list(x), 'STRIP'].nunique()</code> to be slightly faster than <code>d.loc[list(x)].STRIP.nunique()</code>. I'm not sure why, but I used the faster statement in the solution below.</p> <h3>The algorithm in plain English:</h3> <p>For each row, create a set of indices lower than (or equals) the index of current index whose Departure is greater than the current Arrival. This results in a list of sets.</p> <p>Return unique sets that are not subsets of other sets (and for the 2nd algorithm, additionally filter that both <code>STRIP</code>s are referred to by the sets) </p> <h3>(Update) 2nd Improvement:</h3> <p>1 small improvement is made by dropping down to the numpy layer and using <code>np.apply_along_axis</code> instead of using df.apply. This is possible since <code>PLANE</code> always equals the dataframe index and we can access the underlying matrix with <code>df.values</code></p> <p>I found a major improvement with the list comprehension that returns the maximal subsets</p> <pre class="prettyprint"><code>[list(x) for x in sets if ~np.any(sets > x)] </code></pre> <p>The above is an O(n^2) order operation. On small datasets this is very fast. However on bigger datasets this statement becomes the bottle neck. To optimize this, first sort the <code>sets</code>, and loop through the elements once more to find the maximal subsets. Once sorted, it is sufficient to check that elem[n] is not a subset of elem[n+1] to determine if elem[n] is a maximal subset. The sort procedure compares two elements with the <code><</code> operation</p> <h3>Timings:</h3> <p>While my original implementation improved the performance significantly in comparison to the OP's attempt, the algorithm was exponential-ordered, as the following chart shows.</p> <p>I present only the timings for <code>findMultiEvents</code>, <code>hal_2</code> & <code>hal_5</code>. The relative performance of <code>findSinglEvents</code>, <code>hal_1</code> & <code>hal_3</code> are similarly comparable. </p> <p><img src="https://i.stack.imgur.com/RNkgU.png" alt="algorithm execution time ~ input sizes"></p> <p>scroll below to see the benchmarking code.</p> <p><em>note that I stopped benchmarking <code>findMumtiEvents</code> & <code>hal_2</code> earlier when it became evident that they were less efficient by an exponential factor</em></p> <h3>Implementation</h3> <hr> <h3>Improved Implementation:</h3> <pre class="prettyprint"><code>def maximal_subsets(sets): sets.sort() maximal_sets = [] s0 = frozenset() for s in sets[::-1]: if s0 > s or len(s) < 2: continue maximal_sets.append(list(s)) s0 = s return maximal_sets def maximal_subsets_2(sets, d): sets.sort() maximal_sets = [] s0 = frozenset() for s in sets[::-1]: if s0 > s or len(s) < 2 or d.loc[list(s), 'STRIP'].nunique() < 2: continue maximal_sets.append(list(s)) s0 = s return maximal_sets # single def hal_3(d): sets = np.apply_along_axis( lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 1, d.values ) return maximal_subsets(sets) # multi def hal_5(d): sets = np.apply_along_axis( lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 1, d.values ) return maximal_subsets_2(sets, d) </code></pre> <h3>Original Implementation:</h3> <pre class="prettyprint"><code># findSingleEvents def hal_1(d): sets = d.apply( lambda x: frozenset( d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)] ), axis=1 ).unique() return [list(x) for x in sets if ~np.any(sets > x) and len(x) > 1] # findMultiEvents def hal_2(d): sets = d.apply( lambda x: frozenset( d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)] ), axis=1 ).unique() return [list(x) for x in sets if ~np.any(sets > x) and len(x) > 1 and d.loc[list(x), 'STRIP'].nunique() == 2] </code></pre> <h3>Outputs:</h3> <p>The outputs are identical to the OP's implementation.</p> <pre class="prettyprint"><code>hal_1(df[df.STRIP==1]) [[1, 3]] hal_1(df[df.STRIP==2]) [[4, 5], [4, 6]] hal_2(df) [[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]] hal_3(df[df.STRIP==1]) [[1, 3]] hal_3(df[df.STRIP==2]) [[4, 5], [4, 6]] hal_5(df) [[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]] </code></pre> <h3>Test System Details:</h3> <pre class="prettyprint"><code>os: windows 10 python: 3.6 (Anaconda) pandas: 0.22.0 numpy: 1.14.3 </code></pre> <h3>Benchmarking code:</h3> <hr> <pre class="prettyprint"><code>import random def mk_random_df(n): data = [] for i in range(n): arrival = random.uniform(0,1000) departure = arrival + random.uniform(2.0, 10.0) data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure}) df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE']) df = df.sort_values(by=['ARRIVAL']) df = df.reset_index(drop=True) df.PLANE = df.index return df dfs = {i: mk_random_df(100*(2**i)) for i in range(0, 10)} times, times_2, times_5 = [], [], [] for i, v in dfs.items(): if i < 5: t = %timeit -o -n 3 -r 3 findMultiEvents(v) times.append({'size(pow. of 2)': i, 'timings': t}) for i, v in dfs.items(): t = %timeit -o -n 3 -r 3 hal_5(v) times_5.append({'size(pow. of 2)': i, 'timings': t}) for i, v in dfs.items(): if i < 9: t = %timeit -o -n 3 -r 3 hal_2(v) times_2.append({'size(pow. of 2)': i, 'timings': t}) </code></pre>

Nested (double) row by row iteration of a Pandas DataFrame

Tags:

python

loops

pandas

dataframe

vectorization

Hi I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look. The real problem is quite complex, so I'll try to explain this with a simple example (took me quite a while to simplify it :)):

Assume we have an airport with two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Everything is stored in a Pandas DataFrame, which is sorted by the arrival time, as follows (see EDIT2 for a bigger dataset for testing) :

PLANE   STRIP   ARRIVAL   DEPARTURE
0       1       85.00     86.00
1       1       87.87     92.76
2       2       88.34     89.72
3       1       88.92     90.88
4       2       90.03     92.77
5       2       90.27     91.95
6       2       92.42     93.58
7       2       94.42     95.58

Looking for solutions to two cases:

1. Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show [3,4] if there is a valid [3,4,5] case). The list should store the indices of the actual DataFrame rows. See function findSingleEvents() for a solution for this case (runs around 5 ms).

2. Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show [3,4] if there is a [3,4,5] case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See function findMultiEvents() for a solution for this case (runs around 15 ms).

Working Code:

import numpy as np
import pandas as pd
import itertools
from __future__ import division

data =  [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00},
         {'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76},
         {'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72},
         {'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88},
         {'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77},
         {'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95},
         {'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58},
         {'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}]

df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])

def findSingleEvents(df):
    events = []
    for row in df.itertuples():
        #Create temporary dataframe for each main iteration
        dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
        if len(dfTemp)>1:
            #convert index values to integers from long
            current_event = [int(v) for v in dfTemp.index.tolist()]
            #loop backwards to remove elements that do not comply
            for i in reversed(current_event):
                if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
                    current_event.remove(i)
            events.append(current_event)
    #remove duplicate events
    events = map(list, set(map(tuple, events)))
    return events

def findMultiEvents(df):
    events = []
    for row in df.itertuples():
        #Create temporary dataframe for each main iteration
        dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
        if len(dfTemp)>1:
            #convert index values to integers from long
            current_event = [int(v) for v in dfTemp.index.tolist()]
            #loop backwards to remove elements that do not comply
            for i in reversed(current_event):
                if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
                    current_event.remove(i)
            #remove elements only on 1 strip
            if len(df.iloc[current_event].STRIP.unique()) > 1:
                events.append(current_event)
    #remove duplicate events
    events = map(list, set(map(tuple, events)))
    return events

print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)

Verified Output:

[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]

Obviously, these are neither efficient nor elegant solutions. With the huge DataFrame I have, running this will probably take hours. I thought about a vectorized approach quite a while, but could not come up anything solid. Any pointers/help would be welcome! I am also open to Numpy/Cython/Numba based approches.

Thanks!

PS: If you wonder what I will do with the lists: I will assign an EVENT number to each EVENT, and build a separate database with merging the data above, and the EVENT numbers as a separate column, to be used for something else. For Case 1, it will look something like this:

EVENT    PLANE   STRIP   ARRIVAL   DEPARTURE
0        4       2       90.03     92.77
0        5       2       90.27     91.95
1        5       2       90.27     91.95
1        6       2       92.42     95.58

EDIT: Revised the code and the test data set.

EDIT2: Use the code below to generate a 1000 row (or more) long DataFrame for testing purposes. (per @ImportanceOfBeingErnest 's recommendation)

import random
import pandas as pd
import numpy as np

data =  []
for i in range(1000):
    arrival = random.uniform(0,1000)
    departure = arrival + random.uniform(2.0, 10.0)
    data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})

df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
df = df.sort_values(by=['ARRIVAL'])
df = df.reset_index(drop=True)
df.PLANE  = df.index

EDIT3:

Modified version of the accepted answer. The accepted answer was not able to remove subsets of events. Modified version satisfies the rule "(e.g. do not show [3,4] if there is a valid [3,4,5] case)"

def maximal_subsets_modified(sets):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets:
        if not (s > s0) and len(s0) > 1:
            not_in_list = True
            for x in maximal_sets:
                if set(x).issubset(set(s0)):
                    maximal_sets.remove(x)
                if set(s0).issubset(set(x)):
                    not_in_list = False
            if not_in_list:
                maximal_sets.append(list(s0))
        s0 = s
    if len(s0) > 1:
        not_in_list = True
        for x in maximal_sets:
            if set(x).issubset(set(s0)):
                maximal_sets.remove(x)
            if set(s0).issubset(set(x)):
                not_in_list = False
        if not_in_list:
            maximal_sets.append(list(s0))
    return maximal_sets

def maximal_subsets_2_modified(sets, d):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets:
        if not (s > s0) and len(s0) > 1 and d.loc[list(s0), 'STRIP'].nunique() == 2:
            not_in_list = True
            for x in maximal_sets:
                if set(x).issubset(set(s0)):
                    maximal_sets.remove(x)
                if set(s0).issubset(set(x)):
                    not_in_list = False
            if not_in_list:
                maximal_sets.append(list(s0))
        s0 = s
    if len(s0) > 1 and d.loc[list(s), 'STRIP'].nunique() == 2:
        not_in_list = True
        for x in maximal_sets:
            if set(x).issubset(set(s0)):
                maximal_sets.remove(x)
            if set(s0).issubset(set(x)):
                not_in_list = False
        if not_in_list:
            maximal_sets.append(list(s0))
    return maximal_sets

# single

def hal_3_modified(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_modified(sets)

# multi

def hal_5_modified(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_2_modified(sets, d)

578

asked May 30 '18 22:05

marillion

1 Answers

I rewrote the solution using DataFrame.apply instead of iterating, and as optimization used numpy arrays wherever possible. I used frozenset because they are immutable and hashable and thus Series.unique works properly. Series.unique fails on elements of type set.

Also, I found d.loc[list(x), 'STRIP'].nunique() to be slightly faster than d.loc[list(x)].STRIP.nunique(). I'm not sure why, but I used the faster statement in the solution below.

The algorithm in plain English:

For each row, create a set of indices lower than (or equals) the index of current index whose Departure is greater than the current Arrival. This results in a list of sets.

Return unique sets that are not subsets of other sets (and for the 2nd algorithm, additionally filter that both STRIPs are referred to by the sets)

(Update) 2nd Improvement:

1 small improvement is made by dropping down to the numpy layer and using np.apply_along_axis instead of using df.apply. This is possible since PLANE always equals the dataframe index and we can access the underlying matrix with df.values

I found a major improvement with the list comprehension that returns the maximal subsets

[list(x) for x in sets if ~np.any(sets > x)]

The above is an O(n^2) order operation. On small datasets this is very fast. However on bigger datasets this statement becomes the bottle neck. To optimize this, first sort the sets, and loop through the elements once more to find the maximal subsets. Once sorted, it is sufficient to check that elem[n] is not a subset of elem[n+1] to determine if elem[n] is a maximal subset. The sort procedure compares two elements with the < operation

Timings:

While my original implementation improved the performance significantly in comparison to the OP's attempt, the algorithm was exponential-ordered, as the following chart shows.

I present only the timings for findMultiEvents, hal_2 & hal_5. The relative performance of findSinglEvents, hal_1 & hal_3 are similarly comparable.

algorithm execution time ~ input sizes

scroll below to see the benchmarking code.

note that I stopped benchmarking findMumtiEvents & hal_2 earlier when it became evident that they were less efficient by an exponential factor

Implementation

Improved Implementation:

def maximal_subsets(sets):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets[::-1]:
        if s0 > s or len(s) < 2:
            continue
        maximal_sets.append(list(s))
        s0 = s
    return maximal_sets

def maximal_subsets_2(sets, d):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets[::-1]:
        if s0 > s or len(s) < 2 or d.loc[list(s), 'STRIP'].nunique() < 2:
            continue
        maximal_sets.append(list(s))
        s0 = s
    return maximal_sets

# single
def hal_3(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets(sets)
# multi
def hal_5(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_2(sets, d)

Original Implementation:

# findSingleEvents
def hal_1(d):
    sets = d.apply(
       lambda x: frozenset(
           d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
       ),
       axis=1
    ).unique()
    return [list(x) for x in sets if ~np.any(sets > x) and len(x) > 1]

# findMultiEvents
def hal_2(d):
    sets = d.apply(
        lambda x: frozenset(
            d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
        ),
        axis=1
    ).unique()
    return [list(x) for x in sets 
            if ~np.any(sets > x) and
               len(x) > 1 and 
               d.loc[list(x), 'STRIP'].nunique() == 2]

Outputs:

The outputs are identical to the OP's implementation.

hal_1(df[df.STRIP==1])
[[1, 3]]
hal_1(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_2(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
hal_3(df[df.STRIP==1])
[[1, 3]]
hal_3(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_5(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]

Test System Details:

os: windows 10
python: 3.6 (Anaconda)
pandas: 0.22.0
numpy: 1.14.3

Benchmarking code:

import random

def mk_random_df(n):
    data =  []
    for i in range(n):
        arrival = random.uniform(0,1000)
        departure = arrival + random.uniform(2.0, 10.0)
        data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})

    df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
    df = df.sort_values(by=['ARRIVAL'])
    df = df.reset_index(drop=True)
    df.PLANE = df.index
    return df

dfs = {i: mk_random_df(100*(2**i)) for i in range(0, 10)}
times, times_2, times_5 = [], [], []

for i, v in dfs.items():
    if i < 5:
        t = %timeit -o -n 3 -r 3 findMultiEvents(v)
        times.append({'size(pow. of 2)': i, 'timings': t})

for i, v in dfs.items():
    t = %timeit -o -n 3 -r 3 hal_5(v)
    times_5.append({'size(pow. of 2)': i, 'timings': t})

for i, v in dfs.items():
    if i < 9:
        t = %timeit -o -n 3 -r 3 hal_2(v)
        times_2.append({'size(pow. of 2)': i, 'timings': t})

150

answered Nov 02 '22 23:11

Haleemur Ali

Related questions
                            
                                Computing symmetric Kullback-Leibler divergence between two documents
                            
                                How to connect django to docker redis container?
                            
                                OpenCV - specify format while writing image to file (cv2.imwrite)
                            
                                How to plot Pandas datetime series in Seaborn distplot?
                            
                                Difference between numpy ediff1d and diff
                            
                                How can I get a similar summary of a Pandas dataframe as in R?
                            
                                How to determine the number of interned strings in Python 2.7.5?
                            
                                Is there a way compile protocol buffers into pure python code?
                            
                                Reading csv from S3 and inserting into a MySQL table with AWS Lambda
                            
                                How to establish a SSH connection via proxy using Fabric?
                            
                                TensorFlow tf.reshape Fortran order (like numpy)
                            
                                It is possible to generate sequence diagram from python code?
                            
                                CeleryBeat Process consumes all OS memory
                            
                                Pylint message about module length reasoning and ratio of docstrings to lines of code
                            
                                Beautiful Soup Select Vs Find_all data Type
                            
                                Starting Kivy service on bootup (Android)
                            
                                How to interpret output of .predict() from fitted scikit-survival model in python?
                            
                                Not able Running/deploying custom script with shub-image
                            
                                Tensorflow Object Detection API on Windows - error "ModuleNotFoundError: No module named 'utils'"
                            
                                What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With