After never getting a reply from both authors who created the package I'll mention below, I thought that someone here could shed some light into this issue.
I'm working with a large dataset which includes Origin Destination pairs, and the respective passengers going from A to B. Both the Origin and Destination variables are coded using the IATA airport name (3 letters). The original csv files can be found here https://github.com/FilipeamTeixeira/network. Please note that all the 3 csv files are the same except for that one has the ORIGIN/DEST variable as characters, the other as numbers, and the 3rd as larger numbers. But for network purposes they are exactly the same as they provide the same amount of connections.
ORIGIN DEST weight
ABE ATL 1530
ABE AVP 6
ABE BDL 2
ABE BOS 1
ABE BWI 3
ABE CLT 1053
After importing the files, I create a new graph with a <- graph_from_data_frame(netchr, directed = TRUE)
.
Then, as I usually work with a large dataset, I use the disparity filter https://github.com/alessandrobessi/disparityfilter/blob/master/R/disparity_filter.R, to find the backbone structure of my network, and reducing the number of edges/nodes.
For that I run backbone(a)
.
Now the problem is that whenever the original data frame has the Origin and Destination variables as characters or as numbers with more than 4 digits, it returns 0. However when the original data frame has those 2 variables with 3 digits instead, it works perfectly fine and it returns some results.
Running the code below, provides a clear overview of the issue.
# Import network
# Imports csv
netchr <- read.csv("netchr.csv", header = TRUE,sep = “,”, stringsAsFactors = FALSE)
netnumber <- read.csv("netnum.csv", header = TRUE, sep = “,”, stringsAsFactors = FALSE)
netnumber2 <- read.csv("netnum2.csv", header = TRUE, sep = “,”, stringsAsFactors = FALSE)
# Load igraph and dispfilter
library(igraph)
library(disparityfilter)
a <- graph_from_data_frame(netchr, directed = TRUE)
b <- graph_from_data_frame(netnumber, directed = TRUE)
c <- graph_from_data_frame(netnumber2, directed = TRUE)
# Create backbone network
backbone(a) # finds 0
backbone(b) # has results
backbone(c) # finds 0
I'm really struggling to understand what might be happening as even when iGraph creates a graph, it converts the nodes to characters, so logically everything should be the same in the end.
The problem stems from a bug in the disparityfilter package. The internals of the disparity_filter
function, which is called by backbone()
, involve comparing the names of nodes to the indices of nodes (a bug) and in consequence the function only works when node names happen to equal node indices. To be clear, this means that in the generic case (case b in your example) the results are probably all wrong anyway -- although something is getting returned.
Comparing indices to names is the reason the function doesn't return anything with characters, but also the reason it doesn't return anything with big numbers: if the size of the numbers exceed the number of nodes in the network matches never occur.
I'll demonstrate, and then point to where the problem is in the code. I'll then quickly show how different the results are from the version of the code that currently exists and a quick 'fix' (ugly hack) that results in the 'correct' output (I'm scare-quoting correct as I don't really know what the output should be or how to test it).
OK, the link to your networks files is broken so I'll use some data from the igraphdata
package:
# Load the requisite libraries
library(igraph)
library(disparityfilter)
library(igraphdata)
# We'll use the enron email network (b/c cool)
data(enron)
# convert it to a df
df <- igraph::as_data_frame(enron, what = 'edges')
summary(df) # we see nodes numbered from 1:184
#> from to Time Reciptype
#> Min. : 1.0 Min. : 1 Length:125409 Length:125409
#> 1st Qu.: 64.0 1st Qu.: 64 Class :character Class :character
#> Median :108.0 Median :113 Mode :character Mode :character
#> Mean :105.4 Mean :108
#> 3rd Qu.:156.0 3rd Qu.:156
#> Max. :184.0 Max. :184
#> Topic LDC_topic
#> Min. :0.000 Min. :-1.000
#> 1st Qu.:1.000 1st Qu.: 0.000
#> Median :1.000 Median : 0.000
#> Mean :1.711 Mean : 2.572
#> 3rd Qu.:3.000 3rd Qu.: 1.000
#> Max. :3.000 Max. :32.000
# create a weights variable
df$weight <- df$Topic
Now lets create character and a large-number versions of the vertex names
# Create a char version of the nodes by appending 'char' to the number
dfchar <- df
dfchar$from <- paste0("char", dfchar$from)
dfchar$to <- paste0("char", dfchar$to)
# create a big num version
dfbnum <- df
dfbnum$from <- 1000 * dfbnum$from
dfbnum$to <- 1000 * dfbnum$to
Now we convert the data.frames back to graphs
# Now convert the DFs back to graphs
smallnum <- graph_from_data_frame(df, directed = TRUE)
chars <- graph_from_data_frame(dfchar, directed = TRUE)
bignum <- graph_from_data_frame(dfbnum, directed = TRUE)
Then we can run backbone()
across these three graphs to replicate what you found:
## Now we document what you found: namely the anomolous behavior of backbone
newbbs <- backbone(smallnum)
dim(newbbs)
#> [1] 231 4
newbbc <- backbone(chars)
dim(newbbc)
#> [1] 0 4
newbbb <- backbone(bignum)
dim(newbbb)
#> [1] 0 4
So just as you noted, even on other data the backbone()
function finds no matches unless nodes are labeled generically 1:N
.
OK so far I am replicating what you documented. How do we know its an indexing issue inside backbone()
? First let me show you what happens if we make the names of the nodes a little bigger than their indices:
# now to demonstrate the indexing issue quickly, lets increment
# the node names just a bit, and see what gets returned.
# create a medium num version
dfmnum <- df
dfmnum$from <- dfmnum$from + 90 #add about half the number of nodes to the name
dfmnum$to <- dfmnum$to + 90
# convert back to graph
midnum <- graph_from_data_frame(dfmnum)
bbmid <- backbone(midnum)
dim(bbmid)
#> [1] 28 4
As you can see this dramatically alters the performance of the function -- instead of finding 231 results we've found 28! The reason is that half the node names now have no matches to indices, and roughly half do -- so we are getting random (and totally incorrect) results.
The disparityfilter
package is on github and consists of the file disparity_filter.R which can be seen here. On line 58 the disparity_filter
function converts the graph supplied in backbone()
back to a dataframe. Let's do that with our 'character' version of the graph:
e <- igraph::as_data_frame(chars)[,1:2]
head(e)
from to
1 char25 char154
2 char25 char154
3 char30 char30
4 char30 char30
5 char30 char30
6 char30 char30
As we can see the from and to columns here have the names we assigned them. Then beginning on line 63, the disparity_filter()
function loops through the cases in which the degree (d
) is greater than 1 with the line for (u in which(d > 1))
. Then in the switch
statement on line 65 a series of comparisons are made between the names of the nodes and the index u:
w = switch(substr(mode, 1, 1),
a = which(e[, 1] == u | e[, 2] == u),
i = which(e[, 2] == u),
o = which(e[, 1] == u)
)
This is clearly incorrect and will only work if the name of a node happens to match its index. To be explicit, using our chars
version of the graph the first value of u
will be 1 corresponding to the node char25
. The value char25
exists in the vector e[,1]
but of course will never match to an index -- although this is what the authors presumably intend. The same problem is repeated below in the switch()
statement beginning on line 76. Since there are no matches, nothing gets returned when nodes have non-numeric names, or numeric names exceeding the number of nodes.
OK so what about the case where the nodes have names counting up from one? Lets examine what the values of u
and and e[,1]
would be. We'll use the version of the graph with vertex names that 'work':
d <- degree(smallnum)
which(d>1)
25 30 39 52 61 64 66 67 93 100 115 125 138 141 146 156 164 168 170
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
As we can see the numerical names of the vertices don't correspond to the indices anyway! So even in the case where something gets returned, we are simply getting back noise. A quick edit to see the difference; we'll rename the vertices so they correspond to their indices:
renamed <- set.vertex.attribute(smallnum, "name", value=1:length(V(smallnum)))
bbs_problem_revealed <- backbone(renamed)
dim(bbs_problem_revealed)
[1] 9 4
OK so now that the indices match the vertices we only get 9 observations back! Obviously something is off with the function. Is this new answer correct? Honestly I'm not certain, as I am not sure what the output is supposed to be or how I could validate it. Moreover, I would want to really redo the comparisons to match names to names if I were going to rely on the code.
Anyway, my advice would be to not use this function until the package authors get a chance to fix it. I will open a bug report on github.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With