I'm trying to solve a problem in Python/Pandas which I think is closely related to the longest path algorithm. The DataFrame I'm working with has the following structure:
import numpy as np
import pandas as pd
data = {
"cusID": ["001", "001", "001", "001", "001", "001", "002", "002", "002"],
"start": ["A", "B", "C", "D", "A", "E", "B", "C", "D"],
"end": ["B", "C", "D", "A", "E", "A", "C", "D", "E"]
}
df = pd.DataFrame(data)
print(df)
cusID start end
0 001 A B
1 001 B C
2 001 C D
3 001 D A
4 001 A E
5 001 E A
6 002 B C
7 002 C D
8 002 D E
For each customer, I want to find the longest sequence which does not contain A. For instance, for customer 001, the sequence can be viewed as follows:
A -> B -> C -> D -> A -> E -> A
where B -> C -> D is the longest sequence of size 3.
The resulting DataFrame I'm looking for is the following:
cusID longestSeq
0 001 3
1 002 4
Although I wasn't able to write much code to solve this problem, here are some of my thoughts: First, it's obvious that I need to group the original DataFrame by cusID to analyse each of the two sequences separately. One idea I had was to apply some function to transform the DataFrame into this format:
cusID seq
0 001 [A, B, C, D, A, E, A]
1 002 [B, C, D, E]
and then working on each list separately, and use some sort of counter to take the maximum of the length of the paths that exclude A. My problem is transcribing that logic into code (if correct). Any help would be appreciated.
Acyclic graphsA longest path between two given vertices s and t in a weighted graph G is the same thing as a shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G.
We can call the DFS function from every node and traverse for all its children. The recursive formula will be: dp[node] = max(dp[node], 1 + max(dp[child1], dp[child2], dp[child3]..)) At the end check for the maximum value in dp[] array, which will be the longest path in the DAG.
We can find the longest path using two BFSs. The idea is based on the following fact: If we start BFS from any node x and find a node with the longest distance from x, it must be an endpoint of the longest path.
First step is to normalize the sequences.
seqs = pd.concat([
df.drop(columns="end").rename(columns={"start":"node"}),
df.groupby("cusID").tail(1).drop(columns="start").rename(columns={"end":"node"})
])
seqs = seqs.sort_values("cusID", kind="mergesort").reset_index(drop=True)
>>> seqs
cusID node
0 001 A
1 001 B
2 001 C
3 001 D
4 001 A
5 001 E
6 001 A
7 002 B
8 002 C
9 002 D
10 002 E
Then, using zero_runs
we define:
def longest_non_a(seq):
eqa = seq == "A"
runs = zero_runs(eqa)
return (runs[:,1] - runs[:,0]).max()
result = seqs.groupby("cusID")["node"].apply(longest_non_a)
>>> result
cusID
001 3
002 4
Name: node, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With