Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Social graph analysis. 60GB and 100 million nodes

Good evening,

I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?

Thanks in advance.

P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.

like image 474
Giannis H. Avatar asked Mar 10 '12 13:03

Giannis H.


1 Answers

If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that

  1. your R objects are file backed so that you aren't limited by RAM
  2. you can parallelize the degree computation using foreach

Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.

set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
                  sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
            row.names=FALSE,col.names=FALSE)

I next concatenate this file 10 times to make the example a bit bigger.

system("
for i in $(seq 1 10) 
do 
  cat edgelist-small.csv >> edgelist.csv 
done")

Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.

library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE, 
                     type = "integer",sep = ",", 
                     backingfile = "edgelist.bin", 
                     descriptor = "edgelist.desc")
nrow(x)  # 1e7 as expected

We can compute the outdegrees by using bigtable() on the first column.

outdegree <- bigtable(x,1)
head(outdegree)

Quick sanity check to make sure table is working as expected:

# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1]))  # get name of first node
all.equal(as.numeric(outdegree[1]),   # outdegree's answer
          sum(x[,1]==j))              # manual outdegree count

To get indegree, just do bigtable(x,2).

like image 55
Christopher DuBois Avatar answered Oct 17 '22 20:10

Christopher DuBois