Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summing a transaction chain in a dataframe, rows linked by column values

Tags:

python

pandas

I'm trying to link multiple rows from a DataFrame in order to get all possible paths formed by connecting receiver ids to sender ids.

Here is an example of my DataFrame:

   transaction_id sender_id receiver_id  amount
0          213234       002         125      10
1          223322       017         354      90
2          343443       125         689      70
3          324433       689         233       5
4          328909       354         456      10

created with:

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)

Result of my code should be list of chained ids and the total amount for the transaction chain. For the first two rows in the above example, that's something like:

[('002', '125', '689', '233'), 85]
[('017', '354', '456'), 100]

I already tried to iterate through the rows and convert each row to an instance of a Node class, and then used methods for traversing a linked list, but I have no idea what the next step is here:

class Node:
    def __init__(self,transaction_id,sender,receiver,amount):
        self.transac = transaction_id
        self.val = sender_id
        self.next = receiver_id
        self.amount = amount
    def traverse(self):
        node = self # start from the head node
        while node != None:
            print (node.val) # access the node value
            node = node.next # move on to the next node

for index, row in customerTransactionSqlDf3.iterrows():
    index = Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    )

Additional information:

  • The sender_id values are unique, for each sender id there is only one possible transaction chain.
  • There are no cycles, there is never a chain where the receiver id points back to a sender id in the same path.
like image 436
Benjamin Lenglet Avatar asked Oct 16 '22 10:10

Benjamin Lenglet


People also ask

How do you sum across rows in a data frame?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.

How do you sum rows by specific columns in a Pandas DataFrame in Python?

Pandas DataFrame sum() Method The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.

How do I sum specific values in pandas?

sum() function returns the sum of the values for the requested axis. Parameters: axis : {index (0), columns (1)}


2 Answers

You have a directed graph, with the edges formed by id -> id connections. You are trying to enumerate all paths through this graph. This is actually much easier by not using linked lists.

Note that your linked list implementation doesn't actually link the nodes; your next values would have to be referencing other Node instances, not an id.

Because your paths can't have loops, the graph is said to be acyclic. Your paths are really simple too, as you say that there is never more than one receiver id per sender id.

Create a new view into your dataframe with the sender_id as the index, together with the receiver id and amount columns; this will make it very easy to find the next path elements. You can then iterate over these columns and traverse their paths and sum their amounts, trivially. The following code uses already found paths to avoid having to traverse those paths again:

# receiver and amount rows, indexed by sender
edges = df[['sender_id', 'receiver_id', 'amount']].set_index('sender_id')
paths = {}   # sender -> [sender, receiver, receiver, receiver, ...]
totals = {}  # sender -> total amount

for sender, next_, amount in edges.itertuples():
    path = paths[sender] = [sender, next_]
    totals[sender] = amount
    while True:
        if next_ in paths:
            # re-use already found path
            path += paths[next_]
            totals[sender] += totals[next_]
            break

        try:
            next_, amount = edges.loc[next_]
        except KeyError:
            break  # path complete

        path.append(next_)
        totals[sender] += amount

The code could be made more efficient still by updating every sub-path encountered, so when you process the 3rd row for sender id 125, you already have that path processed because you had to traverse it for the path starting at002 for the first row:

for sender, next_, amount in edges.itertuples():
    if sender in paths:
        # already handled as part of a longer path
        continue

    paths[sender], totals[sender] = [sender, next_], amount
    senders = [sender]  # all sender ids along the path

    while True:
        if next_ in paths:
            # re-use already found path
            for sender in senders:
                paths[sender] += paths[next_]
                totals[sender] += totals[next_]
            break

        if next_ not in edges.index:
            break  # path complete

        # start a new path from this sender id
        paths[next_], totals[next_] = [next_], 0
        senders.append(next_)

        next_, amount = edges.loc[next_]
        for sender in senders:
            paths[sender].append(next_)
            totals[sender] += amount

Either way, you now have the full paths and totals worked out for all your transactions. You can turn those back into additional columns:

df['path'], df['total'] = df.sender_id.map(paths), df.sender_id.map(totals)

For your input dataframe, that produces:

   transaction_id sender_id receiver_id  amount                  path  total
0          213234       002         125      10  [002, 125, 689, 233]     85
1          223322       017         354      90       [017, 354, 456]    100
2          343443       125         689      70       [125, 689, 233]     75
3          324433       689         233       5            [689, 233]      5
4          328909       354         456      10            [354, 456]     10

Alternatively, you can pair up the paths and the totals by looping over either dictionary:

for id, path in paths.items():
    print(id, path, totals[id])

which, again for your specific example, produces:

002 ['002', '125', '689', '233'] 85
125 ['125', '689', '233'] 75
689 ['689', '233'] 5
017 ['017', '354', '456'] 100
354 ['354', '456'] 10
like image 94
Martijn Pieters Avatar answered Nov 15 '22 09:11

Martijn Pieters


I have no idea what the next step is here

By using your current implementation, you can connect the two Node objects by iterating each nodes. You can also add visited property in the Node class so that you can identify unique chain as you traverse through the tree i.e. there is not one chain that is a sub-chain of another chain. However, if you want to know the chain for each sender_id, this may be not necessary.

Edit: I noticed that you mentioned the example of the expected result is for the first two rows. This implies that each sender_id should have their own chain. Modifying the traverse method so that it can be used after the nodes are all connected.

Edit: Reimplementing visited property to get unique chain

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)

class Node:
    def __init__(self,transaction_id,sender_id,receiver_id,amount):
        self.transac = transaction_id
        self.sender = sender_id
        self.receiver = receiver_id
        self.next = None
        self.amount = amount
        self.visited = False
    def traverse(self, chain=None, total=0):
        if (self.visited): # undo visited nodes
            return
        self.visited = True
        if chain is None: # this is the beginning of the traversal
            chain = [self.sender]
        chain += [self.receiver]
        total += self.amount
        if self.next is not None:
            return self.next.traverse(chain, total)
        return chain, total

transc = [Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    ) for i, row in df.iterrows()]

# connect the nodes
for i, v in enumerate(transc):
    for j, k in enumerate(transc):
        # if the receiver v same as the sender from j
        if v.receiver == k.sender:
            v.next = k


summary = [i.traverse() for i in transc]
summary = [i for i in summary if i is not None] # removing None

print(summary)

The output:

[
    (['002', '125', '689', '233'], 85), 
    (['017', '354', '456'], 100)
]
like image 41
dee cue Avatar answered Nov 15 '22 08:11

dee cue