I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:
df1:
side_a side_b
a b
b c
c d
k l
l m
l n
p q
q r
r s
df2:
side_a end_point
a c
b c
c c
k m
k n
l m
l n
p s
q s
r s
The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.
I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points']
and that causes certain problems. But if someone wants me to post the code I will, of course.
The desired output would be something like this:
side_a end_point
a [b, c]
b [c]
c [c]
k [l, m]
k [l, n]
l [m]
l [n]
p [q, r, s]
q [r, s]
r [s]
And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.
import pandas as pd
import numpy as np
import itertools
def get_child_list(df, parent_id):
list_of_children = []
list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
for c_, r_ in df[df['side_a'] == parent_id].iterrows():
if r_['side_b'] != parent_id:
list_of_children.append(get_child_list(df, r_['side_b']))
# to flatten the list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
temp_df['side_a'] = row['side_a']
new_df = new_df.append(temp_df)
So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.
Any help or hint is welcomed here, truly. Thanks in advance.
You can use networkx library and graphs:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)
Output:
side_a end_point
0 a [b, c]
1 b [c]
2 c []
3 k [l, m]
4 k [l, n]
5 l [m]
6 l [n]
7 p [q, r, s]
8 q [r, s]
9 r [s]
Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely.
This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure.
## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})
## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}
## Populate the tree with each row
for idx, row in df.iterrows():
tree[row['side_a']] = set(list(tree[row['side_a']]) + list(row['side_b']))
This is much more straightforward now that the data structure is intuitive. Any standard Depth-First-Search algorithm w/ path saving will do the trick. I modified the one in the link to work with this example.
Edit: Reading again it looks you have a condition for search termination in endpoint
(you need to be more clear in your question what is input and what is output). You can adjust dfs_path(tree,**target**, root)
and change the termination condition to return only the correct paths.
## Standard DFS pathfinder
def dfs_paths(tree, root):
stack = [(root, [root])]
while stack:
(node, path) = stack.pop()
for nextNode in tree[node] - set(path):
# Termination condition.
### I set it to terminate search at the end of each path.
### You can edit the termination condition to fit the
### constraints of your goal
if not tree[nextNode]:
yield set(list(path) + list(nextNode)) - set(root)
else:
stack.append((nextNode, path + [nextNode]))
If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. instead of a generator
set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
for row in list(gen.values()).pop():
set_a.append(list(gen.keys()).pop())
end_points.append(row)
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')
df_2[['set_a','end_points']]
set_a end_points
a {b, c, d}
b {c, d}
c {d}
k {n, l}
k {m, l}
l {n}
l {m}
p {s, r, q}
q {s, r}
r {s}
If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX:
import networkx
g = networkx.DiGraph(zip(df1.side_a, df1.side_b))
outdf = df2.apply(lambda row: [row.side_a,
set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}],
axis=1)
outdf
looks like this. Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way.
side_a end_point
0 a {c, b}
1 b {c}
2 c {}
3 k {l, m}
4 k {l, n}
5 l {m}
6 l {n}
7 p {r, q, s}
8 q {r, s}
9 r {s}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With