Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iGraph and disparityfilter package issues with characters and large numbers

Tags:

After never getting a reply from both authors who created the package I'll mention below, I thought that someone here could shed some light into this issue.

I'm working with a large dataset which includes Origin Destination pairs, and the respective passengers going from A to B. Both the Origin and Destination variables are coded using the IATA airport name (3 letters). The original csv files can be found here https://github.com/FilipeamTeixeira/network. Please note that all the 3 csv files are the same except for that one has the ORIGIN/DEST variable as characters, the other as numbers, and the 3rd as larger numbers. But for network purposes they are exactly the same as they provide the same amount of connections.

    ORIGIN  DEST    weight
     ABE    ATL     1530
     ABE    AVP     6
     ABE    BDL     2
     ABE    BOS     1
     ABE    BWI     3
     ABE    CLT     1053

After importing the files, I create a new graph with a <- graph_from_data_frame(netchr, directed = TRUE).

Then, as I usually work with a large dataset, I use the disparity filter https://github.com/alessandrobessi/disparityfilter/blob/master/R/disparity_filter.R, to find the backbone structure of my network, and reducing the number of edges/nodes.

For that I run backbone(a).

Now the problem is that whenever the original data frame has the Origin and Destination variables as characters or as numbers with more than 4 digits, it returns 0. However when the original data frame has those 2 variables with 3 digits instead, it works perfectly fine and it returns some results.

Running the code below, provides a clear overview of the issue.

# Import network
# Imports csv

netchr <- read.csv("netchr.csv", header = TRUE,sep = “,”, stringsAsFactors = FALSE)

netnumber <- read.csv("netnum.csv", header = TRUE, sep = “,”, stringsAsFactors = FALSE)

netnumber2 <- read.csv("netnum2.csv", header = TRUE, sep = “,”, stringsAsFactors = FALSE)

# Load igraph and dispfilter

library(igraph)
library(disparityfilter)

a <- graph_from_data_frame(netchr, directed = TRUE)

b <- graph_from_data_frame(netnumber, directed = TRUE)

c <- graph_from_data_frame(netnumber2, directed = TRUE)

# Create backbone network

backbone(a) # finds 0

backbone(b) # has results

backbone(c) # finds 0

I'm really struggling to understand what might be happening as even when iGraph creates a graph, it converts the nodes to characters, so logically everything should be the same in the end.

like image 531
FilipeTeixeira Avatar asked Apr 13 '17 08:04

FilipeTeixeira


1 Answers

The problem stems from a bug in the disparityfilter package. The internals of the disparity_filter function, which is called by backbone(), involve comparing the names of nodes to the indices of nodes (a bug) and in consequence the function only works when node names happen to equal node indices. To be clear, this means that in the generic case (case b in your example) the results are probably all wrong anyway -- although something is getting returned.

Comparing indices to names is the reason the function doesn't return anything with characters, but also the reason it doesn't return anything with big numbers: if the size of the numbers exceed the number of nodes in the network matches never occur.

I'll demonstrate, and then point to where the problem is in the code. I'll then quickly show how different the results are from the version of the code that currently exists and a quick 'fix' (ugly hack) that results in the 'correct' output (I'm scare-quoting correct as I don't really know what the output should be or how to test it).

Replicating your finding

OK, the link to your networks files is broken so I'll use some data from the igraphdata package:

# Load the requisite libraries
library(igraph)
library(disparityfilter)
library(igraphdata)

# We'll use the enron email network (b/c cool)
data(enron)

# convert it to a df
df <- igraph::as_data_frame(enron, what = 'edges')
summary(df) # we see nodes numbered from 1:184
#>       from             to          Time            Reciptype        
#>  Min.   :  1.0   Min.   :  1   Length:125409      Length:125409     
#>  1st Qu.: 64.0   1st Qu.: 64   Class :character   Class :character  
#>  Median :108.0   Median :113   Mode  :character   Mode  :character  
#>  Mean   :105.4   Mean   :108                                        
#>  3rd Qu.:156.0   3rd Qu.:156                                        
#>  Max.   :184.0   Max.   :184                                        
#>      Topic         LDC_topic     
#>  Min.   :0.000   Min.   :-1.000  
#>  1st Qu.:1.000   1st Qu.: 0.000  
#>  Median :1.000   Median : 0.000  
#>  Mean   :1.711   Mean   : 2.572  
#>  3rd Qu.:3.000   3rd Qu.: 1.000  
#>  Max.   :3.000   Max.   :32.000

# create a weights variable
df$weight <- df$Topic

Now lets create character and a large-number versions of the vertex names

# Create a char version of the nodes by appending 'char' to the number
dfchar <- df
dfchar$from <- paste0("char", dfchar$from)
dfchar$to <- paste0("char", dfchar$to)

# create a big num version
dfbnum <- df
dfbnum$from <- 1000 * dfbnum$from
dfbnum$to <- 1000 * dfbnum$to

Now we convert the data.frames back to graphs

# Now convert the DFs back to graphs
smallnum <- graph_from_data_frame(df, directed = TRUE)

chars <- graph_from_data_frame(dfchar, directed = TRUE)

bignum <- graph_from_data_frame(dfbnum, directed = TRUE)

Then we can run backbone() across these three graphs to replicate what you found:

## Now we document what you found: namely the anomolous behavior of backbone
newbbs <- backbone(smallnum)
dim(newbbs)
#> [1] 231   4

newbbc <- backbone(chars) 
dim(newbbc)
#> [1] 0 4

newbbb <- backbone(bignum)
dim(newbbb)
#> [1] 0 4

So just as you noted, even on other data the backbone() function finds no matches unless nodes are labeled generically 1:N.

Finding the problem

OK so far I am replicating what you documented. How do we know its an indexing issue inside backbone()? First let me show you what happens if we make the names of the nodes a little bigger than their indices:

# now to demonstrate the indexing issue quickly, lets increment
# the node names just a bit, and see what gets returned.
# create a medium num version
dfmnum <- df
dfmnum$from <- dfmnum$from + 90 #add about half the number of nodes to the name
dfmnum$to <- dfmnum$to + 90

# convert back to graph
midnum <- graph_from_data_frame(dfmnum)
bbmid <- backbone(midnum)
dim(bbmid)
#> [1] 28  4

As you can see this dramatically alters the performance of the function -- instead of finding 231 results we've found 28! The reason is that half the node names now have no matches to indices, and roughly half do -- so we are getting random (and totally incorrect) results.

Where is the problem?

The disparityfilter package is on github and consists of the file disparity_filter.R which can be seen here. On line 58 the disparity_filter function converts the graph supplied in backbone() back to a dataframe. Let's do that with our 'character' version of the graph:

e <- igraph::as_data_frame(chars)[,1:2]
head(e)
    from      to
1 char25 char154
2 char25 char154
3 char30  char30
4 char30  char30
5 char30  char30
6 char30  char30

As we can see the from and to columns here have the names we assigned them. Then beginning on line 63, the disparity_filter() function loops through the cases in which the degree (d) is greater than 1 with the line for (u in which(d > 1)). Then in the switch statement on line 65 a series of comparisons are made between the names of the nodes and the index u:

w = switch(substr(mode, 1, 1),
      a = which(e[, 1] == u | e[, 2] == u),
      i = which(e[, 2] == u),
      o = which(e[, 1] == u)
)

This is clearly incorrect and will only work if the name of a node happens to match its index. To be explicit, using our chars version of the graph the first value of u will be 1 corresponding to the node char25. The value char25 exists in the vector e[,1] but of course will never match to an index -- although this is what the authors presumably intend. The same problem is repeated below in the switch() statement beginning on line 76. Since there are no matches, nothing gets returned when nodes have non-numeric names, or numeric names exceeding the number of nodes.

How serious is the problem?

OK so what about the case where the nodes have names counting up from one? Lets examine what the values of u and and e[,1] would be. We'll use the version of the graph with vertex names that 'work':

d <- degree(smallnum)
which(d>1)
 25  30  39  52  61  64  66  67  93 100 115 125 138 141 146 156 164 168 170 
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 

As we can see the numerical names of the vertices don't correspond to the indices anyway! So even in the case where something gets returned, we are simply getting back noise. A quick edit to see the difference; we'll rename the vertices so they correspond to their indices:

renamed <- set.vertex.attribute(smallnum, "name", value=1:length(V(smallnum)))
bbs_problem_revealed <- backbone(renamed)
dim(bbs_problem_revealed)
[1] 9 4

OK so now that the indices match the vertices we only get 9 observations back! Obviously something is off with the function. Is this new answer correct? Honestly I'm not certain, as I am not sure what the output is supposed to be or how I could validate it. Moreover, I would want to really redo the comparisons to match names to names if I were going to rely on the code.

Anyway, my advice would be to not use this function until the package authors get a chance to fix it. I will open a bug report on github.

like image 133
gfgm Avatar answered Sep 25 '22 10:09

gfgm