Background I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well. Requirements The key aspect of this particular program I must write is that it must: <ol> <li>Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)</li> <li>Process the data on each line.</li> <li>From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.</li> <li>These vectors must all be saved to disk in some format or other.</li> </ol> Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4) Parenthesis: Technical Details In case the actual procedure for building vectors affects the solution: For each line in the corpus, one or more vectors must have its basis weights updated. If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index). Vectors do not depend on each other, nor does it matter which order the corpus lines are read in. Attempted Solutions There are three extrema when it comes to how to do this: <ol> <li>I could build all the vectors in memory. Then write them to disk.</li> <li>I could build all the vectors directly on the disk, using shelf of pickle or some such library.</li> <li>I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.</li> </ol> All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons. Goals A good solution would involve: <ol> <li>Building as much as possible in memory.</li> <li>Once memory is full, dump everything to disk.</li> <li>If bits are needed from disk again, recover them back into memory to add stuff to those vectors. </li> <li>Go back to 1 until all vectors are built.</li> </ol> The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing. Question Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it? Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way. Additional Details The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus. Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

A couple libraries come to mind which you might want to evaluate: <ul> <li>joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.</li> <li>mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.</li> </ul>

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background

I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.

Requirements

The key aspect of this particular program I must write is that it must:

Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.

Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)

Parenthesis: Technical Details

In case the actual procedure for building vectors affects the solution:

For each line in the corpus, one or more vectors must have its basis weights updated.

If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).

Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.

Attempted Solutions

There are three extrema when it comes to how to do this:

I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.

All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.

Goals

A good solution would involve:

Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.

The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.

Question

Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?

Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.

Additional Details

The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.

Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

Why is my python script using so much memory?

Python does not free memory back to the system immediately after it destroys some object instance. It has some object pools, called arenas, and it takes a while until those are released. In some cases, you may be suffering from memory fragmentation which also causes process' memory usage to grow.

How do I allocate more memory to a python script?

Python doesn't limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.

Can python write to memory?

Use of memory-based workspaces in a Python is only valid for geoprocessing tools. Memory is not a general-purpose virtual directory where you can write files or other data.

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.

edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:

joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Tags:

python

io

memory

Edward Grefenstette

People also ask

2 Answers

Remi

zeekay

Recent Activity

Donate For Us

Minimising reading from and writing to disk in Python for a memory-heavy operation

Tags:

python

io

memory

Edward Grefenstette

People also ask

2 Answers

Remi

zeekay

Related questions

Recent Activity

Donate For Us