Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving many arrays of different lengths

Tags:

python

numpy

I have ~8000 arrays of two-dimensional points, stored in memory as a Python list of numpy arrays. Each array has shape (x,2), where x is a number between ~600 and ~4000. Essentially, I have a jagged 3-d array.

I want to store this data in a convenient/fast format for reading/writing from disk. I'd rather not create ~8000 separate files, but I'd also rather not pad out a full (8000,4000,2) matrix with zeros if I can avoid it.

How should I store my data on disk, such that both filesize and parsing/serialization are minimized?

like image 398
perimosocordiae Avatar asked Mar 25 '14 17:03

perimosocordiae


People also ask

Can you multiply arrays of different sizes?

You can only multiply two matrices if their dimensions are compatible , which means the number of columns in the first matrix is the same as the number of rows in the second matrix.

Can we concatenate the arrays with different dimensions?

Join a sequence of arrays along an existing axis. The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default). The axis along which the arrays will be joined.

Do arrays use less memory than lists?

1. Since the array in Python is more compact and consumes less memory than a list, it is preferred to use an array when a large amount of data needs to be stored. 2. It is unnecessary to use a list to store the data when all elements are of the same data type and hence an array will be more efficient here.

How to create 2 dimensional array in Python using NumPy?

Creating a Two-dimensional Array If you only use the arange function, it will output a one-dimensional array. To make it a two-dimensional array, chain its output with the reshape function. First, 20 integers will be created and then it will convert the array into a two-dimensional array with 4 rows and 5 columns.


1 Answers

There's a standard called HDF for storing large number data sets. You can find some information in the following link but in general terms, HDF defines a binary file format that can be used for large information storing.

You can find a example here that stores large Numpy arrays on disk. In that post, the writer makes a comparison between Python Pickle and HDF5.

I also recommend you this introduction to HDF5. Here's th h5py package, that is a Pythonic interface to the HDF5 binary data format.

like image 90
ederollora Avatar answered Oct 09 '22 14:10

ederollora