Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested (double) row by row iteration of a Pandas DataFrame

Hi I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look. The real problem is quite complex, so I'll try to explain this with a simple example (took me quite a while to simplify it :)):

Assume we have an airport with two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Everything is stored in a Pandas DataFrame, which is sorted by the arrival time, as follows (see EDIT2 for a bigger dataset for testing) :

PLANE   STRIP   ARRIVAL   DEPARTURE
0       1       85.00     86.00
1       1       87.87     92.76
2       2       88.34     89.72
3       1       88.92     90.88
4       2       90.03     92.77
5       2       90.27     91.95
6       2       92.42     93.58
7       2       94.42     95.58

Looking for solutions to two cases:

1. Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show [3,4] if there is a valid [3,4,5] case). The list should store the indices of the actual DataFrame rows. See function findSingleEvents() for a solution for this case (runs around 5 ms).

2. Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show [3,4] if there is a [3,4,5] case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See function findMultiEvents() for a solution for this case (runs around 15 ms).

Working Code:

import numpy as np
import pandas as pd
import itertools
from __future__ import division

data =  [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00},
         {'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76},
         {'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72},
         {'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88},
         {'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77},
         {'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95},
         {'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58},
         {'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}]

df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])

def findSingleEvents(df):
    events = []
    for row in df.itertuples():
        #Create temporary dataframe for each main iteration
        dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
        if len(dfTemp)>1:
            #convert index values to integers from long
            current_event = [int(v) for v in dfTemp.index.tolist()]
            #loop backwards to remove elements that do not comply
            for i in reversed(current_event):
                if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
                    current_event.remove(i)
            events.append(current_event)
    #remove duplicate events
    events = map(list, set(map(tuple, events)))
    return events

def findMultiEvents(df):
    events = []
    for row in df.itertuples():
        #Create temporary dataframe for each main iteration
        dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
        if len(dfTemp)>1:
            #convert index values to integers from long
            current_event = [int(v) for v in dfTemp.index.tolist()]
            #loop backwards to remove elements that do not comply
            for i in reversed(current_event):
                if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
                    current_event.remove(i)
            #remove elements only on 1 strip
            if len(df.iloc[current_event].STRIP.unique()) > 1:
                events.append(current_event)
    #remove duplicate events
    events = map(list, set(map(tuple, events)))
    return events

print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)

Verified Output:

[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]

Obviously, these are neither efficient nor elegant solutions. With the huge DataFrame I have, running this will probably take hours. I thought about a vectorized approach quite a while, but could not come up anything solid. Any pointers/help would be welcome! I am also open to Numpy/Cython/Numba based approches.

Thanks!

PS: If you wonder what I will do with the lists: I will assign an EVENT number to each EVENT, and build a separate database with merging the data above, and the EVENT numbers as a separate column, to be used for something else. For Case 1, it will look something like this:

EVENT    PLANE   STRIP   ARRIVAL   DEPARTURE
0        4       2       90.03     92.77
0        5       2       90.27     91.95
1        5       2       90.27     91.95
1        6       2       92.42     95.58

EDIT: Revised the code and the test data set.

EDIT2: Use the code below to generate a 1000 row (or more) long DataFrame for testing purposes. (per @ImportanceOfBeingErnest 's recommendation)

import random
import pandas as pd
import numpy as np

data =  []
for i in range(1000):
    arrival = random.uniform(0,1000)
    departure = arrival + random.uniform(2.0, 10.0)
    data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})

df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
df = df.sort_values(by=['ARRIVAL'])
df = df.reset_index(drop=True)
df.PLANE  = df.index

EDIT3:

Modified version of the accepted answer. The accepted answer was not able to remove subsets of events. Modified version satisfies the rule "(e.g. do not show [3,4] if there is a valid [3,4,5] case)"

def maximal_subsets_modified(sets):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets:
        if not (s > s0) and len(s0) > 1:
            not_in_list = True
            for x in maximal_sets:
                if set(x).issubset(set(s0)):
                    maximal_sets.remove(x)
                if set(s0).issubset(set(x)):
                    not_in_list = False
            if not_in_list:
                maximal_sets.append(list(s0))
        s0 = s
    if len(s0) > 1:
        not_in_list = True
        for x in maximal_sets:
            if set(x).issubset(set(s0)):
                maximal_sets.remove(x)
            if set(s0).issubset(set(x)):
                not_in_list = False
        if not_in_list:
            maximal_sets.append(list(s0))
    return maximal_sets

def maximal_subsets_2_modified(sets, d):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets:
        if not (s > s0) and len(s0) > 1 and d.loc[list(s0), 'STRIP'].nunique() == 2:
            not_in_list = True
            for x in maximal_sets:
                if set(x).issubset(set(s0)):
                    maximal_sets.remove(x)
                if set(s0).issubset(set(x)):
                    not_in_list = False
            if not_in_list:
                maximal_sets.append(list(s0))
        s0 = s
    if len(s0) > 1 and d.loc[list(s), 'STRIP'].nunique() == 2:
        not_in_list = True
        for x in maximal_sets:
            if set(x).issubset(set(s0)):
                maximal_sets.remove(x)
            if set(s0).issubset(set(x)):
                not_in_list = False
        if not_in_list:
            maximal_sets.append(list(s0))
    return maximal_sets

# single

def hal_3_modified(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_modified(sets)

# multi

def hal_5_modified(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_2_modified(sets, d)
like image 578
marillion Avatar asked May 30 '18 22:05

marillion


People also ask

Can you iterate through rows of a pandas DataFrame?

Iterating over the rows of a DataFrameYou can do so using either iterrows() or itertuples() built-in methods.

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do you iterate through rows in pandas series?

iteritems() function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series. Example #1: Use Series. iteritems() function to iterate over all the elements in the given series object.

What is Iterrows () in Python?

The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).


1 Answers

I rewrote the solution using DataFrame.apply instead of iterating, and as optimization used numpy arrays wherever possible. I used frozenset because they are immutable and hashable and thus Series.unique works properly. Series.unique fails on elements of type set.

Also, I found d.loc[list(x), 'STRIP'].nunique() to be slightly faster than d.loc[list(x)].STRIP.nunique(). I'm not sure why, but I used the faster statement in the solution below.

The algorithm in plain English:

For each row, create a set of indices lower than (or equals) the index of current index whose Departure is greater than the current Arrival. This results in a list of sets.

Return unique sets that are not subsets of other sets (and for the 2nd algorithm, additionally filter that both STRIPs are referred to by the sets)

(Update) 2nd Improvement:

1 small improvement is made by dropping down to the numpy layer and using np.apply_along_axis instead of using df.apply. This is possible since PLANE always equals the dataframe index and we can access the underlying matrix with df.values

I found a major improvement with the list comprehension that returns the maximal subsets

[list(x) for x in sets if ~np.any(sets > x)]

The above is an O(n^2) order operation. On small datasets this is very fast. However on bigger datasets this statement becomes the bottle neck. To optimize this, first sort the sets, and loop through the elements once more to find the maximal subsets. Once sorted, it is sufficient to check that elem[n] is not a subset of elem[n+1] to determine if elem[n] is a maximal subset. The sort procedure compares two elements with the < operation

Timings:

While my original implementation improved the performance significantly in comparison to the OP's attempt, the algorithm was exponential-ordered, as the following chart shows.

I present only the timings for findMultiEvents, hal_2 & hal_5. The relative performance of findSinglEvents, hal_1 & hal_3 are similarly comparable.

algorithm execution time ~ input sizes

scroll below to see the benchmarking code.

note that I stopped benchmarking findMumtiEvents & hal_2 earlier when it became evident that they were less efficient by an exponential factor

Implementation


Improved Implementation:

def maximal_subsets(sets):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets[::-1]:
        if s0 > s or len(s) < 2:
            continue
        maximal_sets.append(list(s))
        s0 = s
    return maximal_sets

def maximal_subsets_2(sets, d):
    sets.sort()
    maximal_sets = []
    s0 = frozenset()
    for s in sets[::-1]:
        if s0 > s or len(s) < 2 or d.loc[list(s), 'STRIP'].nunique() < 2:
            continue
        maximal_sets.append(list(s))
        s0 = s
    return maximal_sets

# single
def hal_3(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets(sets)
# multi
def hal_5(d):
    sets = np.apply_along_axis(
        lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]), 
        1, d.values
    )
    return maximal_subsets_2(sets, d)

Original Implementation:

# findSingleEvents
def hal_1(d):
    sets = d.apply(
       lambda x: frozenset(
           d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
       ),
       axis=1
    ).unique()
    return [list(x) for x in sets if ~np.any(sets > x) and len(x) > 1]

# findMultiEvents
def hal_2(d):
    sets = d.apply(
        lambda x: frozenset(
            d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
        ),
        axis=1
    ).unique()
    return [list(x) for x in sets 
            if ~np.any(sets > x) and
               len(x) > 1 and 
               d.loc[list(x), 'STRIP'].nunique() == 2]

Outputs:

The outputs are identical to the OP's implementation.

hal_1(df[df.STRIP==1])
[[1, 3]]
hal_1(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_2(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
hal_3(df[df.STRIP==1])
[[1, 3]]
hal_3(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_5(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]

Test System Details:

os: windows 10
python: 3.6 (Anaconda)
pandas: 0.22.0
numpy: 1.14.3

Benchmarking code:


import random

def mk_random_df(n):
    data =  []
    for i in range(n):
        arrival = random.uniform(0,1000)
        departure = arrival + random.uniform(2.0, 10.0)
        data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})

    df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
    df = df.sort_values(by=['ARRIVAL'])
    df = df.reset_index(drop=True)
    df.PLANE = df.index
    return df

dfs = {i: mk_random_df(100*(2**i)) for i in range(0, 10)}
times, times_2, times_5 = [], [], []

for i, v in dfs.items():
    if i < 5:
        t = %timeit -o -n 3 -r 3 findMultiEvents(v)
        times.append({'size(pow. of 2)': i, 'timings': t})

for i, v in dfs.items():
    t = %timeit -o -n 3 -r 3 hal_5(v)
    times_5.append({'size(pow. of 2)': i, 'timings': t})

for i, v in dfs.items():
    if i < 9:
        t = %timeit -o -n 3 -r 3 hal_2(v)
        times_2.append({'size(pow. of 2)': i, 'timings': t})
like image 150
Haleemur Ali Avatar answered Nov 02 '22 23:11

Haleemur Ali