Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new columns based on a tree like pattern

I have the following dataframe:

col1      col2               
basic      c                   
c          c++                 
c++        java                
ruby                                     
php                                      
java       python              
python                                   
r                                        
c#                                      

I want to create new columns based on a pattern followed in dataframe.
For example, in the above dataframe basic->c->c++->java->python order can be observed from col1 and col2.

Logic:

col1 value basic has c value in col2, similarly c value in col1 corresponds to c++ in col2, c++ leads to java in col2 and finally java to python in col2.
Remaining values in "col1" which have corresponding blanks in col2 would be left blank in newly created columns as well. (that is, we consider only the values in "col1" which do not have blanks in col2).

So my output dataframe would be:

     col1     col2   new_col1  new_col2   new_col3   new_col4   
0   basic       c          c        c++       java     python
1       c     c++        c++       java     python         
2     c++    java       java     python                  
3    ruby                                              
4     php                                              
5    java  python     python                           
6  python                                             
7       r                                               
8       c               

Thanks !

like image 236
pc_pyr Avatar asked Jun 18 '26 20:06

pc_pyr


1 Answers

This can be solved through graph theory analysis. It looks like you want to obtain all successors starting from each of the nodes in col2. For that we need to first build a directed graph using the columns col1 and col2. We can use networkX for that, and build a nx.DiGraph from the dataframe using nx.from_pandas_edgelist:

import networkx as nx

m = df.ne('').all(1)
G = nx.from_pandas_edgelist(df[m], 
                            source='col1', 
                            target='col2', 
                            create_using=nx.DiGraph())

And then we can iterate over the nodes in col2, and search for all successors starting from that node. For that we can use dfs_tree, which will traverse the graph in search for the successors with a depth-first-search from source:

all_successors = [list(nx.dfs_tree(G, node)) for node in df.loc[m,'col2']]

Now we can assign back the list of longest paths with:

out = (df.assign(
         **pd.DataFrame(all_successors, index=df[m].index)
         .reindex(df.index)
         .fillna('')
         .add_prefix('new_col')))

print(out)

     col1    col2   new_col0   new_col1   new_col2   new_col3
0   basic       c          c        c++       java     python
1       c     c++        c++       java     python         
2     c++    java       java     python                  
3    ruby                                              
4     php                                              
5    java  python     python                           
6  python                                             
7       r                                               
8       c                  

To better explain this approach, consider instead this slightly different network, with an additional component:

enter image description here

As mentioned what we want here is a list of successors for each of the nodes that we have in Col2. For these problems there are several graph searching algorithms, which can be used to explore the branches of a graph starting from a given node. For that we can use the depth first search based functions available in nx.algorithms.traversal. In this case we want nx.dfs_tree, which returns an oriented tree constructed through a depth-first-search starting from a specified node.

Here are some examples:

list(nx.dfs_tree(G, 'c++'))
# ['c++', 'java', 'python', 'julia']

list(nx.dfs_tree(G, 'python'))
# ['python', 'julia']

list(nx.dfs_tree(G, 'basic'))
# ['basic', 'php', 'sql']

Note that this could get quite tricker in the case of having cycles within the graph. Say there is for instance an edge between c++ and scala. In this case it becomes unclear which should be the path to choose. One way could be traversing all respective paths with nx.dfs_tree and keeping the one of interest predefining some logic, such as keeping the longest. Though it doesn't seem like this is the case in this problem.

like image 168
yatu Avatar answered Jun 20 '26 10:06

yatu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!