Hi I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look. The real problem is quite complex, so I'll try to explain this with a simple example (took me quite a while to simplify it :)):
Assume we have an airport with two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Everything is stored in a Pandas DataFrame, which is sorted by the arrival time, as follows (see EDIT2 for a bigger dataset for testing) :
PLANE STRIP ARRIVAL DEPARTURE
0 1 85.00 86.00
1 1 87.87 92.76
2 2 88.34 89.72
3 1 88.92 90.88
4 2 90.03 92.77
5 2 90.27 91.95
6 2 92.42 93.58
7 2 94.42 95.58
Looking for solutions to two cases:
1. Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show [3,4] if there is a valid [3,4,5] case). The list should store the indices of the actual DataFrame rows. See function findSingleEvents() for a solution for this case (runs around 5 ms).
2. Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show [3,4] if there is a [3,4,5] case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See function findMultiEvents() for a solution for this case (runs around 15 ms).
Working Code:
import numpy as np
import pandas as pd
import itertools
from __future__ import division
data = [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00},
{'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76},
{'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72},
{'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88},
{'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77},
{'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95},
{'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58},
{'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}]
df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
def findSingleEvents(df):
events = []
for row in df.itertuples():
#Create temporary dataframe for each main iteration
dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
if len(dfTemp)>1:
#convert index values to integers from long
current_event = [int(v) for v in dfTemp.index.tolist()]
#loop backwards to remove elements that do not comply
for i in reversed(current_event):
if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
current_event.remove(i)
events.append(current_event)
#remove duplicate events
events = map(list, set(map(tuple, events)))
return events
def findMultiEvents(df):
events = []
for row in df.itertuples():
#Create temporary dataframe for each main iteration
dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
if len(dfTemp)>1:
#convert index values to integers from long
current_event = [int(v) for v in dfTemp.index.tolist()]
#loop backwards to remove elements that do not comply
for i in reversed(current_event):
if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
current_event.remove(i)
#remove elements only on 1 strip
if len(df.iloc[current_event].STRIP.unique()) > 1:
events.append(current_event)
#remove duplicate events
events = map(list, set(map(tuple, events)))
return events
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)
Verified Output:
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]
Obviously, these are neither efficient nor elegant solutions. With the huge DataFrame I have, running this will probably take hours. I thought about a vectorized approach quite a while, but could not come up anything solid. Any pointers/help would be welcome! I am also open to Numpy/Cython/Numba based approches.
Thanks!
PS: If you wonder what I will do with the lists: I will assign an EVENT
number to each EVENT
, and build a separate database with merging the data above, and the EVENT
numbers as a separate column, to be used for something else. For Case 1, it will look something like this:
EVENT PLANE STRIP ARRIVAL DEPARTURE
0 4 2 90.03 92.77
0 5 2 90.27 91.95
1 5 2 90.27 91.95
1 6 2 92.42 95.58
EDIT: Revised the code and the test data set.
EDIT2: Use the code below to generate a 1000 row (or more) long DataFrame for testing purposes. (per @ImportanceOfBeingErnest 's recommendation)
import random
import pandas as pd
import numpy as np
data = []
for i in range(1000):
arrival = random.uniform(0,1000)
departure = arrival + random.uniform(2.0, 10.0)
data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})
df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
df = df.sort_values(by=['ARRIVAL'])
df = df.reset_index(drop=True)
df.PLANE = df.index
EDIT3:
Modified version of the accepted answer. The accepted answer was not able to remove subsets of events. Modified version satisfies the rule "(e.g. do not show [3,4] if there is a valid [3,4,5] case)"
def maximal_subsets_modified(sets):
sets.sort()
maximal_sets = []
s0 = frozenset()
for s in sets:
if not (s > s0) and len(s0) > 1:
not_in_list = True
for x in maximal_sets:
if set(x).issubset(set(s0)):
maximal_sets.remove(x)
if set(s0).issubset(set(x)):
not_in_list = False
if not_in_list:
maximal_sets.append(list(s0))
s0 = s
if len(s0) > 1:
not_in_list = True
for x in maximal_sets:
if set(x).issubset(set(s0)):
maximal_sets.remove(x)
if set(s0).issubset(set(x)):
not_in_list = False
if not_in_list:
maximal_sets.append(list(s0))
return maximal_sets
def maximal_subsets_2_modified(sets, d):
sets.sort()
maximal_sets = []
s0 = frozenset()
for s in sets:
if not (s > s0) and len(s0) > 1 and d.loc[list(s0), 'STRIP'].nunique() == 2:
not_in_list = True
for x in maximal_sets:
if set(x).issubset(set(s0)):
maximal_sets.remove(x)
if set(s0).issubset(set(x)):
not_in_list = False
if not_in_list:
maximal_sets.append(list(s0))
s0 = s
if len(s0) > 1 and d.loc[list(s), 'STRIP'].nunique() == 2:
not_in_list = True
for x in maximal_sets:
if set(x).issubset(set(s0)):
maximal_sets.remove(x)
if set(s0).issubset(set(x)):
not_in_list = False
if not_in_list:
maximal_sets.append(list(s0))
return maximal_sets
# single
def hal_3_modified(d):
sets = np.apply_along_axis(
lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]),
1, d.values
)
return maximal_subsets_modified(sets)
# multi
def hal_5_modified(d):
sets = np.apply_along_axis(
lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]),
1, d.values
)
return maximal_subsets_2_modified(sets, d)
Iterating over the rows of a DataFrameYou can do so using either iterrows() or itertuples() built-in methods.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
iteritems() function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series. Example #1: Use Series. iteritems() function to iterate over all the elements in the given series object.
The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
I rewrote the solution using DataFrame.apply
instead of iterating, and as optimization used numpy arrays wherever possible. I used frozenset
because they are immutable and hashable and thus Series.unique
works properly. Series.unique
fails on elements of type set
.
Also, I found d.loc[list(x), 'STRIP'].nunique()
to be slightly faster than d.loc[list(x)].STRIP.nunique()
. I'm not sure why, but I used the faster statement in the solution below.
For each row, create a set of indices lower than (or equals) the index of current index whose Departure is greater than the current Arrival. This results in a list of sets.
Return unique sets that are not subsets of other sets (and for the 2nd algorithm, additionally filter that both STRIP
s are referred to by the sets)
1 small improvement is made by dropping down to the numpy layer and using np.apply_along_axis
instead of using df.apply. This is possible
since PLANE
always equals the dataframe index and we can access the underlying matrix with df.values
I found a major improvement with the list comprehension that returns the maximal subsets
[list(x) for x in sets if ~np.any(sets > x)]
The above is an O(n^2) order operation. On small datasets this is very fast. However on bigger datasets this statement becomes the bottle neck. To optimize this, first sort the sets
, and loop through the elements once more to find the maximal subsets. Once sorted, it is sufficient to check that elem[n] is not a subset of elem[n+1] to determine if elem[n] is a maximal subset. The sort procedure compares two elements with the <
operation
While my original implementation improved the performance significantly in comparison to the OP's attempt, the algorithm was exponential-ordered, as the following chart shows.
I present only the timings for findMultiEvents
, hal_2
& hal_5
. The relative performance of findSinglEvents
, hal_1
& hal_3
are similarly comparable.
scroll below to see the benchmarking code.
note that I stopped benchmarking findMumtiEvents
& hal_2
earlier when it became evident that they were less efficient by an exponential factor
def maximal_subsets(sets):
sets.sort()
maximal_sets = []
s0 = frozenset()
for s in sets[::-1]:
if s0 > s or len(s) < 2:
continue
maximal_sets.append(list(s))
s0 = s
return maximal_sets
def maximal_subsets_2(sets, d):
sets.sort()
maximal_sets = []
s0 = frozenset()
for s in sets[::-1]:
if s0 > s or len(s) < 2 or d.loc[list(s), 'STRIP'].nunique() < 2:
continue
maximal_sets.append(list(s))
s0 = s
return maximal_sets
# single
def hal_3(d):
sets = np.apply_along_axis(
lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]),
1, d.values
)
return maximal_subsets(sets)
# multi
def hal_5(d):
sets = np.apply_along_axis(
lambda x: frozenset(d.PLANE.values[(d.PLANE.values <= x[0]) & (d.DEPARTURE.values > x[2])]),
1, d.values
)
return maximal_subsets_2(sets, d)
# findSingleEvents
def hal_1(d):
sets = d.apply(
lambda x: frozenset(
d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
),
axis=1
).unique()
return [list(x) for x in sets if ~np.any(sets > x) and len(x) > 1]
# findMultiEvents
def hal_2(d):
sets = d.apply(
lambda x: frozenset(
d.index.values[(d.index.values <= x.name) & (d.DEPARTURE.values > x.ARRIVAL)]
),
axis=1
).unique()
return [list(x) for x in sets
if ~np.any(sets > x) and
len(x) > 1 and
d.loc[list(x), 'STRIP'].nunique() == 2]
The outputs are identical to the OP's implementation.
hal_1(df[df.STRIP==1])
[[1, 3]]
hal_1(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_2(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
hal_3(df[df.STRIP==1])
[[1, 3]]
hal_3(df[df.STRIP==2])
[[4, 5], [4, 6]]
hal_5(df)
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
os: windows 10
python: 3.6 (Anaconda)
pandas: 0.22.0
numpy: 1.14.3
import random
def mk_random_df(n):
data = []
for i in range(n):
arrival = random.uniform(0,1000)
departure = arrival + random.uniform(2.0, 10.0)
data.append({'PLANE':i, 'STRIP':random.randint(1, 2),'ARRIVAL':arrival,'DEPARTURE':departure})
df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
df = df.sort_values(by=['ARRIVAL'])
df = df.reset_index(drop=True)
df.PLANE = df.index
return df
dfs = {i: mk_random_df(100*(2**i)) for i in range(0, 10)}
times, times_2, times_5 = [], [], []
for i, v in dfs.items():
if i < 5:
t = %timeit -o -n 3 -r 3 findMultiEvents(v)
times.append({'size(pow. of 2)': i, 'timings': t})
for i, v in dfs.items():
t = %timeit -o -n 3 -r 3 hal_5(v)
times_5.append({'size(pow. of 2)': i, 'timings': t})
for i, v in dfs.items():
if i < 9:
t = %timeit -o -n 3 -r 3 hal_2(v)
times_2.append({'size(pow. of 2)': i, 'timings': t})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With