Longest path finding with condition

Tags:

I'm trying to solve a problem in Python/Pandas which I think is closely related to the longest path algorithm. The DataFrame I'm working with has the following structure:

Click to copy

import numpy as np 
import pandas as pd

data = {
    "cusID": ["001", "001", "001", "001", "001", "001", "002", "002", "002"],
    "start": ["A", "B", "C", "D", "A", "E", "B", "C", "D"],
    "end": ["B", "C", "D", "A", "E", "A", "C", "D", "E"]
}
df = pd.DataFrame(data)
print(df)

  cusID start end
0   001     A   B
1   001     B   C
2   001     C   D
3   001     D   A
4   001     A   E
5   001     E   A
6   002     B   C
7   002     C   D
8   002     D   E

For each customer, I want to find the longest sequence which does not contain A. For instance, for customer 001, the sequence can be viewed as follows:

A -> B -> C -> D -> A -> E -> A

where B -> C -> D is the longest sequence of size 3.

The resulting DataFrame I'm looking for is the following:

Click to copy

  cusID longestSeq
0   001          3
1   002          4

Although I wasn't able to write much code to solve this problem, here are some of my thoughts: First, it's obvious that I need to group the original DataFrame by cusID to analyse each of the two sequences separately. One idea I had was to apply some function to transform the DataFrame into this format:

Click to copy

  cusID                       seq
0   001     [A, B, C, D, A, E, A]
1   002              [B, C, D, E]

and then working on each list separately, and use some sort of counter to take the maximum of the length of the paths that exclude A. My problem is transcribing that logic into code (if correct). Any help would be appreciated.

843

asked Nov 05 '20 11:11

glpsx

1 Answers

First step is to normalize the sequences.

Click to copy

seqs = pd.concat([
    df.drop(columns="end").rename(columns={"start":"node"}),
    df.groupby("cusID").tail(1).drop(columns="start").rename(columns={"end":"node"})
])
seqs = seqs.sort_values("cusID", kind="mergesort").reset_index(drop=True)

>>> seqs
   cusID node
0    001    A
1    001    B
2    001    C
3    001    D
4    001    A
5    001    E
6    001    A
7    002    B
8    002    C
9    002    D
10   002    E

Then, using zero_runs we define:

Click to copy

def longest_non_a(seq):
    eqa = seq == "A"
    runs = zero_runs(eqa)
    return (runs[:,1] - runs[:,0]).max()
    
result = seqs.groupby("cusID")["node"].apply(longest_non_a)

>>> result
cusID
001    3
002    4
Name: node, dtype: int64

answered Oct 17 '22 13:10

orlp

Related questions
                            
                                Split a Python List into Chunks with Maximum Memory Size
                            
                                How can I add an element to a PyTorch tensor along a certain dimension?
                            
                                Renumbering line by line
                            
                                How to display a pandas dataframe within a VBOX using ipywidgets
                            
                                Breaking change for google-api-python-client 1.8.1 - AttributeError: module 'googleapiclient' has no attribute '__version__'
                            
                                Pydantic model for array of jsons
                            
                                Computing `AB⁻¹` with `np.linalg.solve()`
                            
                                Why can I not assign `cls.__hash__ = id`?
                            
                                Tkinter how to bind to shift+tab
                            
                                3D Gridded Data Interpolation in Julia
                            
                                AttributeError: 'tuple' object has no attribute 'rank' when calling fit on a Keras model with custom generator
                            
                                How to get numpy working properly in Anaconda Python 3.7.6
                            
                                How to scrape all topics from twitter
                            
                                What is a good design pattern to combine datasets that are related but stored in different dataframes?
                            
                                Tensorflow-gpu issue (CUDA runtime error: device kernel image is invalid)
                            
                                Prefect how to avoid rerunning a task
                            
                                Keras - no good way to stop and resume training?
                            
                                Pandas DataFrame filter rows using another DataFrame Column
                            
                                PyTorch - RuntimeError: [enforce fail at inline_container.cc:209] . file not found: archive/data.pkl
                            
                                Python(17874,0x111e92dc0) malloc: can't allocate region

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Longest path finding with condition

Tags:

python

algorithm

pandas

dataframe

glpsx

People also ask

1 Answers

orlp

Recent Activity

Donate For Us