Fastest way to perform complex search on pandas dataframe

Tags:

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

After search/sort:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

After search/sort:

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

433

asked May 28 '19 14:05

MaxB

2 Answers

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
    x = df.to_numpy()
    # obtain a flat numpy array from both columns
    b = x[:,0:2].ravel()
    _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
    # Index of duplicate values in b
    ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
    # Indices to be used to replace the content in the columns
    replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
    # Col index of where duplicate value is, 0 or 1
    col = (ixs_drop % 2) ^ 1
    # 2d array to index and replace values in the df
    # index to obtain values with which to replace
    keep_cols = np.broadcast_to([3,5],(len(col),2))
    ixs = np.concatenate([col[:,None], keep_cols], 1)
    # translate indices to row indices
    rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
    c = np.empty((len(col), 5), dtype=x.dtype)
    c[:,::2] = x[rows_drop[:,None], ixs]
    c[:,1::2] = x[rows_replace[:,None], [2,4]]
    # update dataframe and drop rows
    df.iloc[rows_replace, 1:] = c
    return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      DEF        XYZ    456      893    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      893    8000      9999
2      AAA        CCC    473      341    5555      5555

This approach does not assume any particular order in terms of the rows where the duplicate is, and the same applies to the columns (to cover the edge case described in the question). If we use for instance the following dataframe:

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      XYZ        DEF    893      456    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

     flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      456    8000      9999
2      AAA        CCC    473      341    5555      5555

answered Sep 21 '22 14:09

yatu

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

Installation networkx

Pip: pip install networkx
Anaconda: conda install -c anaconda networkx

199

answered Sep 20 '22 14:09

BENY

Related questions
                            
                                How can I type hint an attribute in Python 3.5?
                            
                                Postgres: values query on json key with django
                            
                                find row positions and column names of cells contanining inf in pandas dataframe
                            
                                Python - AttributeError: 'numpy.ndarray' object has no attribute 'append'
                            
                                Detecting incorrect assertion methods
                            
                                How does subprocess.call() work with shell=False?
                            
                                Getting 415 code with django.test.Client's patch method
                            
                                How does the scipy distance_transform_edt function work?
                            
                                Where can I put a startup script in Jupyter?
                            
                                override create method in django rest generics CreateAPIView
                            
                                isinstance fails for a type imported via package and from the same module directly
                            
                                TypeError: bad argument type for built-in operation
                            
                                How to properly runserver on different settings for Django?
                            
                                Memory error when using Keras ImageDataGenerator
                            
                                Limiting line length in Atom
                            
                                Install pandas in a Dockerfile
                            
                                How to extract the HTTP error text from a requests response?
                            
                                Loading multiple csv files of a folder into one dataframe
                            
                                pipenv shell in Jenkins return the issue termios.error: (25, 'Inappropriate ioctl for device')
                            
                                fastText embeddings sentence vectors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to perform complex search on pandas dataframe

Tags:

python

pandas

binary-search-tree

MaxB

People also ask

2 Answers

yatu

BENY

Recent Activity

Donate For Us