Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to perform complex search on pandas dataframe

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

After search/sort:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

After search/sort:

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

like image 433
MaxB Avatar asked May 28 '19 14:05

MaxB


People also ask

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Is apply faster than Iterrows?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Why is Itertuples faster than Iterrows?

The reason iterrows() is slower than itertuples() is due to iterrows() doing a lot of type checks in the lifetime of its call.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.


2 Answers

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
    x = df.to_numpy()
    # obtain a flat numpy array from both columns
    b = x[:,0:2].ravel()
    _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
    # Index of duplicate values in b
    ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
    # Indices to be used to replace the content in the columns
    replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
    # Col index of where duplicate value is, 0 or 1
    col = (ixs_drop % 2) ^ 1
    # 2d array to index and replace values in the df
    # index to obtain values with which to replace
    keep_cols = np.broadcast_to([3,5],(len(col),2))
    ixs = np.concatenate([col[:,None], keep_cols], 1)
    # translate indices to row indices
    rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
    c = np.empty((len(col), 5), dtype=x.dtype)
    c[:,::2] = x[rows_drop[:,None], ixs]
    c[:,1::2] = x[rows_replace[:,None], [2,4]]
    # update dataframe and drop rows
    df.iloc[rows_replace, 1:] = c
    return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      DEF        XYZ    456      893    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      893    8000      9999
2      AAA        CCC    473      341    5555      5555

This approach does not assume any particular order in terms of the rows where the duplicate is, and the same applies to the columns (to cover the edge case described in the question). If we use for instance the following dataframe:

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      XYZ        DEF    893      456    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

     flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      456    8000      9999
2      AAA        CCC    473      341    5555      5555
like image 25
yatu Avatar answered Sep 21 '22 14:09

yatu


This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

Installation networkx

  • Pip: pip install networkx
  • Anaconda: conda install -c anaconda networkx
like image 199
BENY Avatar answered Sep 20 '22 14:09

BENY