Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data compression in python/numpy

I'm looking at using the amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them over to my local drive for ease of analysis, ect. You have to pay per data you move over, so I want to compress all my sim soutions as small as possible. They are simply numpy arrays saved in the form of .mat files, using:

import scipy.io as sio
sio.savemat(filepath, do_compression = True) 

So my question is, what is the best way to compress numpy arrays (they are currently stored in .mat files, but I could store them using any python method), by using python compression saving, linux compression, or both?

I am in the linux environment, and I am open to any kind of file compression.

like image 422
tylerthemiler Avatar asked Aug 18 '11 23:08

tylerthemiler


People also ask

Are NumPy arrays compressed?

It can compress binary data very efficiently. It stores arrays either on file or compressed in memory. Compression is based on blosc. See the scipy video for a bit of context.

What is data compression in Python?

What is data compression in machine learning? It uses an internal memory state to avoid the need to perform a one-to-one mapping of individual input symbols to distinct representations that use an integer number of bits, and it clears out the internal memory only after encoding the entire string of data symbols.

How do you compress an array in Python?

compress() in Python. The numpy. compress() function returns selected slices of an array along mentioned axis, that satisfies an axis.

Why does NumPy use less memory?

NumPy uses much less memory to store data The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.


1 Answers

Unless you know something special about the arrays (e.g. sparseness, or some pattern) you aren't going to do much better than the default compression, and maybe gzip on top of that. In fact you may not even need to gzip the files if you're using HTTP for downloads and your server is configured to do compression. Good lossless compression algorithms rarely vary by more than 10%.

If savemat works as advertized you should be able to get gzip compression all in python with:

import scipy.io as sio
import gzip

f_out = gzip.open(filepath_dot_gz, 'wb')
sio.savemat(f_out, do_compression = True)
like image 145
mjhm Avatar answered Sep 25 '22 07:09

mjhm