I have a dataframe c
with lots of different columns. Also, arr
is a dataframe that corresponds to a subset of c
: arr = c[c['A_D'] == 'A']
.
The main idea of my code is to iterate over all rows in the c
-dataframe and search for all the possible cases (in the arr
dataframe) where some specific conditions should happen:
c['A_D'] == D
and c['Already_linked'] == 0
hour
in the arr
dataframe must be less than the hour_aux
in the c
dataframeAlready_linked
of the arr
dataframe must be zero: arr.Already_linked == 0
Terminal
and the Operator
needs to be the same in the c and arr
dataframeRight now, the conditions are stored using both Boolean indexing and groupby get_group:
arr
dataframe in order to choose the same Operator and Terminal: g = groups.get_group((row.Operator, row.Terminal
))c
dataframe and where Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
For each of the rows in the c
dataframe that verify all conditions, a vb
dataframe is created. Naturally, this dataframe has different lengths in each iteration. After creating the vb
dataframe, my goal is to choose the index of the vb
dataframe that minimises the time between vb.START
and c[x
]. The FightID
that corresponds to this index is then stored in the c
dataframe on column a
. Additionally, since the arrival was linked to a departure, the column Already_linked
in the arr
dataframe is changed from 0 to 1.
It is important to notice that the column Already_linked
of the arr
dataframe may change in every iteration (and arr.Already_linked == 0
is one of the conditions to create the vb
dataframe). Therefore, it is not possible to parallelize this code.
I have already used c.itertuples()
for efficiency, however since c
has millions of rows, this code is still too time consuming.
Other option would also be to use pd.apply
to every row. Nonetheless, this is not really straightforward since in each loop there are values that change in both c
and arr
(also, I believe that even with pd.apply
it would be extremely slow).
Is there any possible way to convert this for loop in a vectorized solution (or decrease the running time by 10X(if possible even more) )?
Initial dataframe:
START END A_D Operator FlightID Terminal TROUND_ID tot
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR QR001 4 QR002 70
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL001 3 " " 84
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL001 3 " " 78
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS VS001 3 " " 45
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL DL401 3 " " 9
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL DL401 3 " " 19
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL DL401 3 " " 3
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL DL401 3 " " 32
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL DL401 3 " " 95
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL DL402 3 " " 58
Desired Output (some of the columns were excluded in the dataframe below. Only the a
and Already_linked
columns are relevant):
START END A_D Operator a Already_linked
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR 0 1
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL402 1
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL401 1
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS No_link_found 0
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL 0 0
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL 0 1
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL 0 0
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL 0 0
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL 0 0
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL 0 1
Code:
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
try:
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
aux = (vb.START - row.x).abs().idxmin()
c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue
c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked == 0) & (c.A_D == 'D'),'No_link_found',c['a'])
Code for the initial c
dataframe:
import numpy as np
import pandas as pd
import io
s = '''
A_D Operator FlightID Terminal TROUND_ID tot
A QR QR001 4 QR002 70
D DL DL001 3 " " 84
D DL DL001 3 " " 78
D VS VS001 3 " " 45
A DL DL401 3 " " 9
A DL DL401 3 " " 19
A DL DL401 3 " " 3
A DL DL401 3 " " 32
A DL DL401 3 " " 95
A DL DL402 3 " " 58
'''
data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)
d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
'2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
'2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
'2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00',
'2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}
aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)
c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'),
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START - pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != " ", 1 ,0)
arr = c[c['A_D'] == 'A']
The answer is YES. You can handle large datasets in python using Pandas with some techniques.
In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row's corresponding index value, while the remaining values are the row values.
To loop through a set of code a specified number of times, we can use the range() function, The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.
Iterating over the rows of a DataFrame You can do so using either iterrows() or itertuples() built-in methods.
While this is not a vecterized solution, it should speed things up rapidly if your sample data set mimics your true data set. Currently, you are wasting time looping over every row, but you only care about looping over rows where ['A_D'] == 'D'
and ['Already_linked'] ==0
. Instead remove the if's and loop over the truncated dataframe which is only 30% of the initial dataframe
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string)
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue
Your question was if there is a way to vectorize the for loop, but I think that question hides what you really want which is an easy way to speed your code up. For performance questions, a good starting point is always profiling. However, I have a strong suspicion that the dominant operation in your code is .query(row.query_string)
. Running that for every row is expensive if arr
is large.
For arbitrary queries, that runtime can't really be improved at all without removing dependencies between iterations and parallelizing the expensive step. You might be a bit luckier though. Your query string always checks two different columns to see if they're equal to something you care about. However, for each row that requires going through your entire slice of arr
. Since the slice changes each time, that could cause problems, but here are some ideas:
arr
each time anyway, maintain a view of just the arr.Already_Linked==0
rows so you're iterating over a smaller object.arr
by Terminal
and Operator
. Then, instead of running through all of arr
, first select the group you want and then do your slicing and filtering. This would require rethinking the exact implementation of query_string
a little bit, but the advantage is that if you have a lot of terminals and operators you'll typically be working over a much smaller object than arr
. Moreover, you wouldn't even have to query that object since that was implicitly done by the groupby.aux.hour
typically relates to row.hour_aux
, you might have improvements by sorting aux
at the beginning with respect to hour
. Just using the inequality operator you probably wouldn't see any gains, but you could pair that with a logarithmic search for the cutoff point and then just slice up to that cutoff point.arr
for every row will offer substantially more gains than just switching frameworks or vectorizing bits and pieces.Expanding on some of those points a little bit and adapting @DJK's code a bit, look at what happens when we have the following changes.
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
g.loc[aux, 'Already_linked'] = 1
continue
except:
continue
Part of the reason your query is so slow is because it's searching over all of arr
each time. In contrast, the .groupby()
executes in roughly the same time as one query, but then for every subsequent iteration you can just use .get_group()
to efficiently find the tiny subset of the data you care about.
A helpful (extremely crude) rule of thumb when benchmarking is that a billion things takes a second. If you're seeing much longer times than that for something measured in millions of things, like your millions of rows, that means that for each of those rows you're doing tons of things to get up to billions of operations. That leaves a ton of potential for better algorithms to reduce the number of operations, whereas vectorization really only yields constant factor improvements (and for many string/query operations not even great improvements at that).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With