Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all descendants for points in Python

I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.

I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. But if someone wants me to post the code I will, of course.

The desired output would be something like this:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.

Any help or hint is welcomed here, truly. Thanks in advance.

like image 896
jovicbg Avatar asked Apr 17 '18 20:04

jovicbg


3 Answers

You can use networkx library and graphs:

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)

Output:

  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]
like image 73
Scott Boston Avatar answered Nov 01 '22 04:11

Scott Boston


Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely.

Transform the df to a dict to better represent a tree structure

This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure.

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))

Traverse the Tree

This is much more straightforward now that the data structure is intuitive. Any standard Depth-First-Search algorithm w/ path saving will do the trick. I modified the one in the link to work with this example.

Edit: Reading again it looks you have a condition for search termination in endpoint (you need to be more clear in your question what is input and what is output). You can adjust dfs_path(tree,**target**, root) and change the termination condition to return only the correct paths.

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
            else:
                stack.append((nextNode, path + [nextNode]))
        

Build a dataframe from the generators we yielded

If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. instead of a generator

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():
        set_a.append(list(gen.keys()).pop())
        end_points.append(row)
                      
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

Output

df_2[['set_a','end_points']]


set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}
like image 31
Brendan Frick Avatar answered Nov 01 '22 03:11

Brendan Frick


If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX:

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 
                  axis=1)    

outdf looks like this. Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way.

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}
like image 2
chthonicdaemon Avatar answered Nov 01 '22 04:11

chthonicdaemon