Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a 1.2GB list of edges into a sparse matrix

I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like

287111206 357850135

I would like to convert it into a sparse adjacency matrix and output that to a file.

Some statistics for my data:

Number of edges: around 62500000
Number of vertices: around 31250000

I asked much the same question before at https://stackoverflow.com/a/38667644/2179021 and got a great answer. The problem is that I can't get it to work.

I first tried np.loadtxt to load in the file but it was very slow and used a huge amount of memory. So instead I moved to pandas.read_csv which is very fast but this caused it own problems. This is my current code:

import pandas
import numpy as np
from scipy import sparse

data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)

The problem is that the pandas dataframe data is huge and I am effectively making a copy in A which is inefficient. However things are even worse as the code crashes with

<type 'instancemethod'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 13, in <module>
    rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py 
<type 'numpy.ndarray'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 12, in <module>
    k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
    iflag = np.cumsum(flag) - 1
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
    return cumsum(axis, dtype, out)
MemoryError

So my questions are:

  1. Can I avoid having both the 1.2GB pandas dataframe and the 1.2GB numpy array copy in memory?
  2. Is there some way to get the code to complete in 8GB of RAM?

You can reproduce a test input of the size I am trying to process with:

import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
    fromnode = str(random.randint(0, n-1)).zfill(9)
    tonode = str(random.randint(0, n-1)).zfill(9)
    print fromnode, tonode

Update

I have now tried a number of different approaches, all of which have failed. Here is a summary.

  1. Using igraph with g = Graph.Read_Ncol('edges.txt'). This uses a huge amount of RAM which crashes my computer.
  2. Using networkit with G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False). This uses a huge amount of RAM which crashes my computer.
  3. The code above in this question but using np.loadtxt("edges.txt") instead of pandas. This uses a huge amount of RAM which crashes my computer.

I then wrote separate code which remapped all the vertex names to number from 1..|V| where |V| is the total number of vertices. This should save the code that imports the edge list from having to build up a table that maps the vertex names. Using this I tried:

  1. Using this new remapped edge list file I used igraph again with g = Graph.Read_Edgelist("edges-contig.txt"). This now works although it takes 4GB of RAM (which is way more than the theoretical amount it should). However, there is no igraph function to write out a sparse adjacency matrix from a graph. The recommended solution is to convert the graph to a coo_matrix. Unfortunately this uses a huge amount of RAM which crashes my computer.
  2. Using the remapped edge list file I used networkit with G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne). This also works using less than the 4GB that igraph needs. networkit also comes with a function to write Matlab files (which is a form of sparse adjacency matrix that scipy can read). However networkit.graphio.writeMat(G,"test.mat") uses a huge amount of RAM which crashes my computer.

Finally sascha's answer below does complete but takes about 40 minutes.

like image 301
graffe Avatar asked Jul 31 '16 20:07

graffe


People also ask

How do you convert a matrix to a sparse matrix?

Description. S = sparse( A ) converts a full matrix into sparse form by squeezing out any zero elements. If a matrix contains many zeros, converting the matrix to sparse storage saves memory. S = sparse( m,n ) generates an m -by- n all zero sparse matrix.

How do you convert a dense matrix to a sparse matrix?

A dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.

How do you find the sparse matrix?

To check whether a matrix is a sparse matrix, we only need to check the total number of elements that are equal to zero. If this count is more than (m * n)/2, we return true.


1 Answers

In my answer I consider the case where the ids of the nodes are given by 9 characters long strings each character from [0-9A-Za-z]. n of these node ids should be mapped on the values [0,n-1] (which might be not necessary for your application, but still of general interest).

The next considerations, I'm sure you are aware of, are here for the sake of completeness:

  1. Memory is the bottle neck.
  2. There are around 10^8 strings in the file.
  3. a 9 character long string + int32 value pair costs around 120 bytes in a dictionary, resulting in 12GB memory usage for the file.
  4. a string id from file can be mapped onto an int64: there are 62 different characters -> can be encoded with 6 bits, 9 characters in the string -> 6*9=54<64 bit. See also toInt64() method further below.
  5. there are int64+int32=12 byte "real" information => ca. 1.2 GB could be enough, however the cost for such a pair in a dictionary is around 60 bytes (around 6 GB RAM needed).
  6. Creating small objects (on the heap) results in a lot of memory overhead, thus bundling these objects in arrays is advantageous. Interesting information about memory used by python objects can be found in his tutorial stile article. Interesting experiences with reducing the memory usage are made public in this blog entry.
  7. python-list is out of question as data structure as well as dictionary. array.array could be alternative, but we use np.array (because there are sorting algorithms for np.array but not array.array).

1. step: reading file and mapping strings to int64. It is a pain to let a np.array grow dynamically, so we assume we now the number of edges in the file (it would be nice to have it in the header, but it can also be deduced from the file size):

import numpy as np

def read_nodes(filename, EDGE_CNT):   
    nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
    cnt=0
    for line in open(filename,"r"):
        nodes[cnt:cnt+2]=map(toInt64, line.split())  # use map(int, line.split()) for cases without letters
    return nodes

2. step: converting the int64-values into values [0,n-1]:

Possibility A, needs 3*0.8GB:

def maps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_ids, nodes = np.unique(nodes, return_index=True)  
    return (len(unique_ids), nodes)

Possibility B, needs 2*0.8GB, but is somewhat slower:

def maps_to_ids(filename, EDGE_CNT):
    """ return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_map = np.unique(nodes)
    for i in xrange(len(nodes)):
        node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
        nodes[i]=node_id  
    return (len(unique_map), nodes)  

3. step: put it all into coo_matrix:

from scipy import sparse
def data_as_coo_matrix(filename, EDGE_CNT)
    node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)    
    rows=nodes[::2]#it is only a view, not a copy
    cols=nodes[1::2]#it is only a view, not a copy

    return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))

For calling data_as_coo_matrix("data.txt", 62500000), memory need peaks at 2.5GB (but with int32 instead of int64 only 1.5GB are needed). It took around 5 minutes on my machine, but my machine is pretty slow...

So what is different from your solution?

  1. I get only unique values from np.unique (and not all the indices and the inverse), so there is some memory saved - I can replace the old ids with the new in-place.
  2. I have no experience with pandas so maybe there is some copying involved between pandas<->numpy data structures?

What is the difference from sascha's solution?

  1. There is no need for the list sorted all the time - it is enough to sort after all items are in the list, that is what np.unique() does. sascha's solution keep the list sorted the whole time - you have to pay for this with at least a constant factor, even if the running time stays O(n log(n)). I assumed, an add operation would be O(n), but as pointed out it is O(log(n).

What is the difference to GrantJ's solution?

  1. The size of the resulting sparse matrix is NxN - with N - number of different nodes and not 2^54x2^54 (with very many empty rows and column).

PS:
Here is my idea, how the 9 char string id can be mapped on an int64 value, but I guess this function could become a bottle neck the way it is written and should get optimized.

def toInt64(string):
    res=0L
    for ch in string:
        res*=62
        if ch <='9':
          res+=ord(ch)-ord('0')
        elif ch <='Z':
          res+=ord(ch)-ord('A')+10
        else:
          res+=ord(ch)-ord('a')+36
    return res
like image 125
ead Avatar answered Sep 30 '22 11:09

ead