I have the following dataframe:
col1 col2
basic c
c c++
c++ java
ruby
php
java python
python
r
c#
I want to create new columns based on a pattern followed in dataframe.
For example, in the above dataframe basic->c->c++->java->python order can be observed from col1 and col2.
Logic:
col1 value basic has c value in col2, similarly c value in col1 corresponds to c++ in col2, c++ leads to java in col2 and finally java to python in col2.
Remaining values in "col1" which have corresponding blanks in col2 would be left blank in newly created columns as well. (that is, we consider only the values in "col1" which do not have blanks in col2).
So my output dataframe would be:
col1 col2 new_col1 new_col2 new_col3 new_col4
0 basic c c c++ java python
1 c c++ c++ java python
2 c++ java java python
3 ruby
4 php
5 java python python
6 python
7 r
8 c
Thanks !
This can be solved through graph theory analysis. It looks like you want to obtain all successors starting from each of the nodes in col2. For that we need to first build a directed graph using the columns col1 and col2. We can use networkX for that, and build a nx.DiGraph from the dataframe using nx.from_pandas_edgelist:
import networkx as nx
m = df.ne('').all(1)
G = nx.from_pandas_edgelist(df[m],
source='col1',
target='col2',
create_using=nx.DiGraph())
And then we can iterate over the nodes in col2, and search for all successors starting from that node. For that we can use dfs_tree, which will traverse the graph in search for the successors with a depth-first-search from source:
all_successors = [list(nx.dfs_tree(G, node)) for node in df.loc[m,'col2']]
Now we can assign back the list of longest paths with:
out = (df.assign(
**pd.DataFrame(all_successors, index=df[m].index)
.reindex(df.index)
.fillna('')
.add_prefix('new_col')))
print(out)
col1 col2 new_col0 new_col1 new_col2 new_col3
0 basic c c c++ java python
1 c c++ c++ java python
2 c++ java java python
3 ruby
4 php
5 java python python
6 python
7 r
8 c
To better explain this approach, consider instead this slightly different network, with an additional component:

As mentioned what we want here is a list of successors for each of the nodes that we have in Col2. For these problems there are several graph searching algorithms, which can be used to explore the branches of a graph starting from a given node. For that we can use the depth first search based functions available in nx.algorithms.traversal. In this case we want nx.dfs_tree, which returns an oriented tree constructed through a depth-first-search starting from a specified node.
Here are some examples:
list(nx.dfs_tree(G, 'c++'))
# ['c++', 'java', 'python', 'julia']
list(nx.dfs_tree(G, 'python'))
# ['python', 'julia']
list(nx.dfs_tree(G, 'basic'))
# ['basic', 'php', 'sql']
Note that this could get quite tricker in the case of having cycles within the graph. Say there is for instance an edge between c++ and scala. In this case it becomes unclear which should be the path to choose. One way could be traversing all respective paths with nx.dfs_tree and keeping the one of interest predefining some logic, such as keeping the longest. Though it doesn't seem like this is the case in this problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With