Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Sankey Diagram using NetworkD3 package in R

Currently I am trying to create an interactive Sankey with the networkD3 Package following the instructions by Chris Grandrud (https://christophergandrud.github.io/networkD3/).
What I don't understand is is table-format, since he just uses two columns for visualising more transitions. To be more specific, I have a dataset containing four columns which represent 4 years. Inside these columns are different hotel names, whereas each row represents one customer, who is "tracked" over these four years.

    URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
    Energy <- jsonlite::fromJSON(URL)

    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
         Target = "target", Value = "value", NodeID = "name",
         units = "TWh", fontSize = 12, nodeWidth = 30)

To give you an overview of my data here is a screenshot:

SampleDataScreenshot

I would give you more "coded" information but since I am very new to the topic of R I hope you can follow my train of thoughts in this problem. If not, please do not hesistate to question it.

Thank you :)

like image 932
Phipsy Avatar asked May 23 '17 10:05

Phipsy


1 Answers

you need two dataframes: one listing all nodes (containing the names) and one listing the links. The latter contains three columns, the source node, the target node and some value, indicating the strength or width of the link. In the links dataframe you refer to the nodes by the (zero-based) position in the nodes dataframe.

Assuming you data looks like:

df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 stringsAsFactors = FALSE)

For the diagram you need to differentiate not only between the hotels but between the hotel/year combination since each of them should be one node:

df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)

the links are the "transitions" between the hotels from one year to the next:

library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())

colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")

links <- rbind(as.data.frame(trans1_2), 
               as.data.frame(trans2_3), 
               as.data.frame(trans3_4))

finally, the dataframes need to be referenced to each other:

nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

Then the diagram can be drawn:

library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
              Target = "target", Value = "sum", NodeID = "name",
              fontSize = 12, nodeWidth = 30)

There might be more elegant solutions, but this could be a starting point for your problem. If you don't like the "Year..." in the nodes' names you con remove them after setting up the dataframes.

like image 72
scheddy Avatar answered Oct 09 '22 13:10

scheddy