Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reorder Sankey diagram vertically based on label value

I'm trying to plot patient flows between 3 clusters in a Sankey diagram. I have a pd.DataFrame counts with from-to values, see below. To reproduce this DF, here is the counts dict that should be loaded into a pd.DataFrame (which is the input for the visualize_cluster_flow_counts function).

    from    to      value
0   C1_1    C1_2    867
1   C1_1    C2_2    405
2   C1_1    C0_2    2
3   C2_1    C1_2    46
4   C2_1    C2_2    458
... ... ... ...
175 C0_20   C0_21   130
176 C0_20   C2_21   1
177 C2_20   C1_21   12
178 C2_20   C0_21   0
179 C2_20   C2_21   96

The from and to values in the DataFrame represent the cluster number (either 0, 1, or 2) and the amount of days for the x-axis (between 1 and 21). If I plot the Sankey diagram with these values, this is the result: sankey plot

Code:

import plotly.graph_objects as go

def visualize_cluster_flow_counts(counts):
    all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))            
    
    froms, tos, vals, labs = [], [], [], []
    for index, row in counts.iterrows():
        froms.append(all_sources.index(row.values[0]))
        tos.append(all_sources.index(row.values[1]))
        vals.append(row[2])
        labs.append(row[3])
                
    fig = go.Figure(data=[go.Sankey(
        arrangement='snap',
        node = dict(
          pad = 15,
          thickness = 5,
          line = dict(color = "black", width = 0.1),
          label = all_sources,
          color = "blue"
        ),
        link = dict(
          source = froms,
          target = tos,
          value = vals,
          label = labs
      ))])

    fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
    fig.show()

visualize_cluster_flow_counts(counts)

However, I would like to vertically order the bars so that the C0's are always on top, the C1's are always in the middle, and the C2's are always at the bottom (or the other way around, doesn't matter). I know that we can set node.x and node.y to manually assign the coordinates. So, I set the x-values to the amount of days * (1/range of days), which is an increment of +- 0.045. And I set the y-values based on the cluster value: either 0, 0.5 or 1. I then obtain the image below. The vertical order is good, but the vertical margins between the bars are obviously way off; they should be similar to the first result.

enter image description here

The code to produce this is:

import plotly.graph_objects as go

def find_node_coordinates(sources):
    x_nodes, y_nodes = [], []
    
    for s in sources:
        # Shift each x with +- 0.045
        x = float(s.split("_")[-1]) * (1/21)
        x_nodes.append(x)
        
        # Choose either 0, 0.5 or 1 for the y-value
        cluster_number = s[1]
        if cluster_number == "0": y = 1
        elif cluster_number == "1": y = 0.5
        else: y = 1e-09
        
        y_nodes.append(y)
                
    return x_nodes, y_nodes


def visualize_cluster_flow_counts(counts):
    all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))    
        
    node_x, node_y = find_node_coordinates(all_sources)
    
    froms, tos, vals, labs = [], [], [], []
    for index, row in counts.iterrows():
        froms.append(all_sources.index(row.values[0]))
        tos.append(all_sources.index(row.values[1]))
        vals.append(row[2])
        labs.append(row[3])
                
    fig = go.Figure(data=[go.Sankey(
        arrangement='snap',
        node = dict(
          pad = 15,
          thickness = 5,
          line = dict(color = "black", width = 0.1),
          label = all_sources,
          color = "blue",
          x = node_x,
          y = node_y,
        ),
        link = dict(
          source = froms,
          target = tos,
          value = vals,
          label = labs
      ))])

    fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
    fig.show()
    
    
visualize_cluster_flow_counts(counts)

Question: how do I fix the margins of the bars, so that the result looks like the first result? So, for clarity: the bars should be pushed to the bottom. Or is there another way that the Sankey diagram can vertically re-order the bars automatically based on the label value?

like image 960
sandertjuh Avatar asked May 14 '21 08:05

sandertjuh


People also ask

How to use Sankey diagram in dash?

Sankey Diagram in Dash Dash is the best way to build analytical apps in Python using Plotly figures. To run the app below, run pip install dash, click "Download" to get the code and run python app.py. Get started with the official Dash docs and learn how to effortlessly style & deploy apps like this with Dash Enterprise.

Why do the links in my Sankey diagram overlap?

I created a Sankey diagram using plotly (python) and it looks like this: As you can see, some links overlap, but this plot can be easily changed (manually) to this: I think the overlapping result comes from the 3rd column of nodes being centered on Y.

Is there a way to sort data in Sankey visual?

As I know, there's no sorting option in Sankey visual. But the visual will be sorted based on Weight option. Thereby, to achieve your requirement, you can define a weight value for your sequence. Something like: For more detailed features about this custom visual, I would suggest you to contact Custom Visual support team (

Can Matplotlib’S Sankey diagrams track flow across nodes?

Matplotlib’s sankey package doesn’t seem to do everything you might hope to do with Sankey diagrams. For example, it does not seem to track flows across nodes using color to indicate the origin or a third property.


Video Answer


1 Answers

Firstly I don't think there is a way with the current exposed API to achieve your goal smoothly you can check the source code here.

Try to change your find_node_coordinates function as follows (note that you should pass the counts DataFrame to):

counts = pd.DataFrame(counts_dict) 
def find_node_coordinates(sources, counts):
    x_nodes, y_nodes = [], []

    flat_on_top = False
    range = 1 # The y range
    total_margin_width = 0.15
    y_range = 1 - total_margin_width 
    margin = total_margin_width / 2 # From number of  Cs
    srcs = counts['from'].values.tolist()
    dsts = counts['to'].values.tolist() 
    values = counts['value'].values.tolist() 
    max_acc = 0

    def _calc_day_flux(d=1):
        _max_acc = 0 
        for i in [0,1,2]:
            # The first ones
            from_source = 'C{}_{}'.format(i,d) 
            indices = [i for i, val in enumerate(srcs) if val == from_source]
            for j in indices: 
                _max_acc += values[j]
        
        return _max_acc

    def _calc_node_io_flux(node_str): 
        c,d = int(node_str.split('_')[0][-1]), int(node_str.split('_')[1])
        _flux_src = 0 
        _flux_dst = 0 

        indices_src = [i for i, val in enumerate(srcs) if val == node_str]
        indices_dst = [j for j, val in enumerate(dsts) if val == node_str]
        for j in indices_src: 
            _flux_src += values[j]
        for j in indices_dst: 
            _flux_dst += values[j]

        return max(_flux_dst, _flux_src) 

    max_acc = _calc_day_flux() 
    graph_unit_per_val = y_range / max_acc
    print("Graph Unit per Acc Val", graph_unit_per_val) 
 
    
    for s in sources:
        # Shift each x with +- 0.045
        d = int(s.split("_")[-1])
        x = float(d) * (1/21)
        x_nodes.append(x)
        
        print(s, _calc_node_io_flux(s))
        # Choose either 0, 0.5 or 1 for the y-v alue
        cluster_number = s[1]

        
        # Flat on Top
        if flat_on_top: 
            if cluster_number == "0": 
              y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val + margin +  _calc_node_io_flux('C{}_{}'.format(0, d))*graph_unit_per_val/2
            elif cluster_number == "1": y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin +  _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val/2
            else: y = 1e-09
        # Flat On Bottom
        else: 
            if cluster_number == "0": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val / 2)
            elif cluster_number == "1": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val /2 )
            elif cluster_number == "2": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(2,d)) * graph_unit_per_val /2 )
            
        y_nodes.append(y)
                
    return x_nodes, y_nodes

Sankey graphs supposed to weigh their connection width by their corresponding normalized values right? Here I do the same, first, it calculates each node flux, later by calculating the normalized coordinate the center of each node calculated according to their flux.

Here is the sample output of your code with the modified function, note that I tried to adhere to your code as much as possible so it's a bit unoptimized(for example, one could store the values of nodes above each specified source node to avoid its flux recalculation).

With flag flat_on_top = True enter image description here

With flag flat_on_top = False enter image description here

There is a bit of inconsistency in the flat_on_bottom version which I think is caused by the padding or other internal sources of Plotly API.

like image 155
Parsa Rahimi Avatar answered Oct 09 '22 22:10

Parsa Rahimi