Good evening,
I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?
Thanks in advance.
P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.
If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that
foreach
Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.
set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
row.names=FALSE,col.names=FALSE)
I next concatenate this file 10 times to make the example a bit bigger.
system("
for i in $(seq 1 10)
do
cat edgelist-small.csv >> edgelist.csv
done")
Next we load the bigtabulate
package and read in the text file with our edgelist. The command read.big.matrix()
creates a file-backed object in R.
library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE,
type = "integer",sep = ",",
backingfile = "edgelist.bin",
descriptor = "edgelist.desc")
nrow(x) # 1e7 as expected
We can compute the outdegrees by using bigtable()
on the first column.
outdegree <- bigtable(x,1)
head(outdegree)
Quick sanity check to make sure table is working as expected:
# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1])) # get name of first node
all.equal(as.numeric(outdegree[1]), # outdegree's answer
sum(x[,1]==j)) # manual outdegree count
To get indegree, just do bigtable(x,2)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With