Efficient ways to write a large NumPy array to a file

Tags:

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.

Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.

It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.

Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

990

asked Jan 08 '12 06:01

Fomite

3 Answers

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.

The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.

For platform-independent, but numpy-specific, saves, you can use save (doc).

Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.

116

answered Nov 08 '22 04:11

Andrew Jaffe

I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.

answered Nov 08 '22 04:11

HardlyKnowEm

Try Joblib - Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

Edit: Newer (2016) blog entry on data persistence in Joblib

answered Nov 08 '22 03:11

reclosedev

Related questions
                            
                                Extracting href with Beautiful Soup
                            
                                Convert relative URL to fully qualified URL using Python
                            
                                Python group by array a, and summarize array b - Performance
                            
                                Pad list in Python
                            
                                Python reference to callback in dictionary
                            
                                Merging a list of lists
                            
                                Is there a Python library (or pattern) like Ruby's andand?
                            
                                Can I find the path to the python executable from inside python? [duplicate]
                            
                                why the use of an ORM with NoSql (like MongoDB) [closed]
                            
                                Does Coldfusion support dynamic arguments?
                            
                                Python imports across modules and global variables
                            
                                Can't get cx_Oracle to work with Python version 2.7 / mac os 10.7.2 (Lion) - missing_OCIAttrGet
                            
                                how to get all possible combination of items from 2-dimensional list in python?
                            
                                Unordered collection for unhashable objects?
                            
                                Iteratively parsing HTML (with lxml?)
                            
                                SQLAlchemy is Throwing an IntegrityError due to a DBSession.add()
                            
                                why is my text not aligning properly in wxPython?
                            
                                Binding <Return> to button is not working as expected
                            
                                Python Value in List
                            
                                python csv DictReader type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient ways to write a large NumPy array to a file

Tags:

python

numpy

scientific-computing