I'm trying to link multiple rows from a DataFrame in order to get all possible paths formed by connecting receiver ids to sender ids.
Here is an example of my DataFrame:
transaction_id sender_id receiver_id amount
0 213234 002 125 10
1 223322 017 354 90
2 343443 125 689 70
3 324433 689 233 5
4 328909 354 456 10
created with:
df = pd.DataFrame(
{'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)
Result of my code should be list of chained ids and the total amount for the transaction chain. For the first two rows in the above example, that's something like:
[('002', '125', '689', '233'), 85]
[('017', '354', '456'), 100]
I already tried to iterate through the rows and convert each row to an instance of a Node
class, and then used methods for traversing a linked list, but I have no idea what the next step is here:
class Node:
def __init__(self,transaction_id,sender,receiver,amount):
self.transac = transaction_id
self.val = sender_id
self.next = receiver_id
self.amount = amount
def traverse(self):
node = self # start from the head node
while node != None:
print (node.val) # access the node value
node = node.next # move on to the next node
for index, row in customerTransactionSqlDf3.iterrows():
index = Node(
row["transaction_id"],
row["sender_id"],
row["receiver_id"],
row["amount"]
)
Additional information:
To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.
Pandas DataFrame sum() Method The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.
sum() function returns the sum of the values for the requested axis. Parameters: axis : {index (0), columns (1)}
You have a directed graph, with the edges formed by id -> id
connections. You are trying to enumerate all paths through this graph. This is actually much easier by not using linked lists.
Note that your linked list implementation doesn't actually link the nodes; your next
values would have to be referencing other Node
instances, not an id
.
Because your paths can't have loops, the graph is said to be acyclic. Your paths are really simple too, as you say that there is never more than one receiver id per sender id.
Create a new view into your dataframe with the sender_id as the index, together with the receiver id and amount columns; this will make it very easy to find the next path elements. You can then iterate over these columns and traverse their paths and sum their amounts, trivially. The following code uses already found paths to avoid having to traverse those paths again:
# receiver and amount rows, indexed by sender
edges = df[['sender_id', 'receiver_id', 'amount']].set_index('sender_id')
paths = {} # sender -> [sender, receiver, receiver, receiver, ...]
totals = {} # sender -> total amount
for sender, next_, amount in edges.itertuples():
path = paths[sender] = [sender, next_]
totals[sender] = amount
while True:
if next_ in paths:
# re-use already found path
path += paths[next_]
totals[sender] += totals[next_]
break
try:
next_, amount = edges.loc[next_]
except KeyError:
break # path complete
path.append(next_)
totals[sender] += amount
The code could be made more efficient still by updating every sub-path encountered, so when you process the 3rd row for sender id 125
, you already have that path processed because you had to traverse it for the path starting at002
for the first row:
for sender, next_, amount in edges.itertuples():
if sender in paths:
# already handled as part of a longer path
continue
paths[sender], totals[sender] = [sender, next_], amount
senders = [sender] # all sender ids along the path
while True:
if next_ in paths:
# re-use already found path
for sender in senders:
paths[sender] += paths[next_]
totals[sender] += totals[next_]
break
if next_ not in edges.index:
break # path complete
# start a new path from this sender id
paths[next_], totals[next_] = [next_], 0
senders.append(next_)
next_, amount = edges.loc[next_]
for sender in senders:
paths[sender].append(next_)
totals[sender] += amount
Either way, you now have the full paths and totals worked out for all your transactions. You can turn those back into additional columns:
df['path'], df['total'] = df.sender_id.map(paths), df.sender_id.map(totals)
For your input dataframe, that produces:
transaction_id sender_id receiver_id amount path total
0 213234 002 125 10 [002, 125, 689, 233] 85
1 223322 017 354 90 [017, 354, 456] 100
2 343443 125 689 70 [125, 689, 233] 75
3 324433 689 233 5 [689, 233] 5
4 328909 354 456 10 [354, 456] 10
Alternatively, you can pair up the paths and the totals by looping over either dictionary:
for id, path in paths.items():
print(id, path, totals[id])
which, again for your specific example, produces:
002 ['002', '125', '689', '233'] 85
125 ['125', '689', '233'] 75
689 ['689', '233'] 5
017 ['017', '354', '456'] 100
354 ['354', '456'] 10
I have no idea what the next step is here
By using your current implementation, you can connect the two Node
objects by iterating each nodes. You can also add visited
property in the Node
class so that you can identify unique chain as you traverse through the tree i.e. there is not one chain that is a sub-chain of another chain. However, if you want to know the chain for each sender_id
, this may be not necessary.
Edit: I noticed that you mentioned the example of the expected result is for the first two rows. This implies that each sender_id
should have their own chain. Modifying the traverse
method so that it can be used after the nodes are all connected.
Edit: Reimplementing visited
property to get unique chain
df = pd.DataFrame(
{'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)
class Node:
def __init__(self,transaction_id,sender_id,receiver_id,amount):
self.transac = transaction_id
self.sender = sender_id
self.receiver = receiver_id
self.next = None
self.amount = amount
self.visited = False
def traverse(self, chain=None, total=0):
if (self.visited): # undo visited nodes
return
self.visited = True
if chain is None: # this is the beginning of the traversal
chain = [self.sender]
chain += [self.receiver]
total += self.amount
if self.next is not None:
return self.next.traverse(chain, total)
return chain, total
transc = [Node(
row["transaction_id"],
row["sender_id"],
row["receiver_id"],
row["amount"]
) for i, row in df.iterrows()]
# connect the nodes
for i, v in enumerate(transc):
for j, k in enumerate(transc):
# if the receiver v same as the sender from j
if v.receiver == k.sender:
v.next = k
summary = [i.traverse() for i in transc]
summary = [i for i in summary if i is not None] # removing None
print(summary)
The output:
[
(['002', '125', '689', '233'], 85),
(['017', '354', '456'], 100)
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With