Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx, and other useful libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of RAM. Of course, I do, creating the matrix for pairwise distances on 200k+ data takes a lot of memory.

Here comes the catch: I would really like to do this on crappy computers with low amounts of RAM.

Is there a feasible way for me to make this work without the constraints of low RAM? That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of RAM! I would like to implement this in python, and be able to use the numpy, scipy, sklearn, and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

Is this feasible? And how would I go about it, what can I start to read up on?

like image 835
Ekgren Avatar asked Apr 22 '13 14:04

Ekgren


People also ask

How do I save a large NumPy array in Python?

You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format. You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma.

How can I speed up my NumPy operation?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Is NumPy array memory efficient?

1. NumPy uses much less memory to store data. The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.


1 Answers

Using numpy.memmap you create arrays directly mapped into a file:

import numpy a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000)) # here you will see a 762MB file created in your working directory     

You can treat it as a conventional array: a += 1000.

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

del a     b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000)) 

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000)) b[1,5] = 123456. print a[1,5] #123456.0 

Great! a was changed together with b. And the changes are already written on disk.

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),                  offset=150000*1000*32/8) b[1,2] = 999999. print a[150001,2] #999999.0 

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

Other references:

  • Is it possible to map a discontiuous data on disk to an array with python?

  • numpy.memmap documentation here.

like image 76
Saullo G. P. Castro Avatar answered Sep 18 '22 23:09

Saullo G. P. Castro