first question here. I'll try to be concise. I am generating multiple arrays containing feature information for a machine learning application. As the arrays do not have equal dimensions, I store them in a dictionary rather than in an array. There are two different kinds of features, so I am using two different dictionaries. I also generate labels to go with the features. These labels are stored in arrays. Additionally, there are strings containing the exact parameters used for running the script and a timestamp. All in all it looks like this: <pre class="prettyprint"><code>import numpy as np feature1 = {} feature2 = {} label1 = np.array([]) label2 = np.array([]) docString = 'Commands passed to the script were...' # features look like this: feature1 = {'case 1': np.array([1, 2, 3, ...]), 'case 2': np.array([2, 1, 3, ...]), 'case 3': np.array([2, 3, 1, ...]), and so on... } </code></pre> Now my goal would be to do this: <pre class="prettyprint"><code>np.savez(outputFile, saveFeature1 = feature1, saveFeature2 = feature2, saveLabel1 = label1, saveLabel2 = label2, saveString = docString) </code></pre> This seemingly works (i.e. such a file is saved with no error thrown and can be loaded again). However, when I try to load for example the feature from the file again: <pre class="prettyprint"><code>loadedArchive = np.load(outFile) loadedFeature1 = loadedArchive['saveFeature1'] loadedString = loadedArchive['saveString'] </code></pre> Then instead of getting a dictionary back, I get a numpy array of shape (0) where I don't know how to access the contents: <pre class="prettyprint"><code>In []: loadedFeature1 Out[]: array({'case 1': array([1, 2, 3, ...]), 'case 2': array([2, 3, 1, ...]), ..., }, dtype=object) </code></pre> Also strings become arrays and get a strange datatype: <pre class="prettyprint"><code>In []: loadedString.dtype Out[]: dtype('|S20') </code></pre> So in short, I am assuming this is not how it is done correctly. However I would prefer not to put all variables into one big dictionary because I will retrieve them in another process and would like to just loop over the dictionary.keys() without worrying about string comparison. Any ideas are greatly appreciated. Thanks

As @fraxel has already suggested, using pickle is a much better option in this case. Just save a <code>dict</code> with your items in it. However, be sure to use pickle with a binary protocol. By default, it less efficient format, which will result in excessive memory usage and huge files if your arrays are large. <pre class="prettyprint"><code>saved_data = dict(outputFile, saveFeature1 = feature1, saveFeature2 = feature2, saveLabel1 = label1, saveLabel2 = label2, saveString = docString) with open('test.dat', 'wb') as outfile: pickle.dump(saved_data, outfile, protocol=pickle.HIGHEST_PROTOCOL) </code></pre> That having been said, let's take a look at what's happening in more detail for illustrative purposes. <code>numpy.savez</code> expects each item to be an array. In fact, it calls <code>np.asarray</code> on everything you pass in. If you turn a <code>dict</code> into an array, you'll get an object array. E.g. <pre class="prettyprint"><code>import numpy as np test = {'a':np.arange(10), 'b':np.arange(20)} testarr = np.asarray(test) </code></pre> Similarly, if you make an array out of a string, you'll get a string array: <pre class="prettyprint"><code>In [1]: np.asarray('abc') Out[1]: array('abc', dtype='|S3') </code></pre> However, because of a quirk in the way object arrays are handled, if you pass in a single object (in your case, your <code>dict</code>) that isn't a tuple, list, or array, you'll get a 0-dimensional object array. This means that you can't index it directly. In fact, doing <code>testarr[0]</code> will raise an <code>IndexError</code>. The data is still there, but you need to add a dimension first, so you have to do <code>yourdictionary = testarr.reshape(-1)[0]</code>. If all of this seems clunky, it's because it is. Object arrays are essentially always the wrong answer. (Although <code>asarray</code> should arguably pass in <code>ndmin=1</code> to <code>array</code>, which would solve this particular problem, but potentially break other things.) <code>savez</code> is intended to store arrays, rather than arbitrary objects. Because of the way it works, it can store completely arbitrary objects, but it shouldn't be used that way. If you did want to use it, though, a quick workaround would be to do: <pre class="prettyprint"><code>np.savez(outputFile, saveFeature1 = [feature1], saveFeature2 = [feature2], saveLabel1 = [label1], saveLabel2 = [label2], saveString = docString) </code></pre> And you'd then access things with <pre class="prettyprint"><code>loadedArchive = np.load(outFile) loadedFeature1 = loadedArchive['saveFeature1'][0] loadedString = str(loadedArchive['saveString']) </code></pre> However, this is clearly much more clunky than just using pickle. Use <code>numpy.savez</code> when you're just saving arrays. In this case, you're saving nested data structures, not arrays.

How to save dictionaries and arrays in the same archive (with numpy.savez)

Tags:

python

dictionary

numpy

first question here. I'll try to be concise.

I am generating multiple arrays containing feature information for a machine learning application. As the arrays do not have equal dimensions, I store them in a dictionary rather than in an array. There are two different kinds of features, so I am using two different dictionaries.

I also generate labels to go with the features. These labels are stored in arrays. Additionally, there are strings containing the exact parameters used for running the script and a timestamp.

All in all it looks like this:

import numpy as np    

feature1 = {}
feature2 = {}
label1 = np.array([])
label2 = np.array([])
docString = 'Commands passed to the script were...'

# features look like this:
feature1 = {'case 1': np.array([1, 2, 3, ...]),
            'case 2': np.array([2, 1, 3, ...]),
            'case 3': np.array([2, 3, 1, ...]),
            and so on... }

Now my goal would be to do this:

np.savez(outputFile, 
         saveFeature1 = feature1, 
         saveFeature2 = feature2, 
         saveLabel1 = label1, 
         saveLabel2 = label2,
         saveString = docString)

This seemingly works (i.e. such a file is saved with no error thrown and can be loaded again). However, when I try to load for example the feature from the file again:

loadedArchive = np.load(outFile)
loadedFeature1 = loadedArchive['saveFeature1']
loadedString = loadedArchive['saveString']

Then instead of getting a dictionary back, I get a numpy array of shape (0) where I don't know how to access the contents:

In []: loadedFeature1
Out[]: 
       array({'case 1': array([1, 2, 3, ...]), 
              'case 2': array([2, 3, 1, ...]), 
              ..., }, dtype=object)

Also strings become arrays and get a strange datatype:

In []: loadedString.dtype
Out[]: dtype('|S20')

So in short, I am assuming this is not how it is done correctly. However I would prefer not to put all variables into one big dictionary because I will retrieve them in another process and would like to just loop over the dictionary.keys() without worrying about string comparison.

Any ideas are greatly appreciated. Thanks

617

asked Apr 09 '12 15:04

surchs

3 Answers

As @fraxel has already suggested, using pickle is a much better option in this case. Just save a dict with your items in it.

However, be sure to use pickle with a binary protocol. By default, it less efficient format, which will result in excessive memory usage and huge files if your arrays are large.

saved_data = dict(outputFile, 
                  saveFeature1 = feature1, 
                  saveFeature2 = feature2, 
                  saveLabel1 = label1, 
                  saveLabel2 = label2,
                  saveString = docString)

with open('test.dat', 'wb') as outfile:
    pickle.dump(saved_data, outfile, protocol=pickle.HIGHEST_PROTOCOL)

That having been said, let's take a look at what's happening in more detail for illustrative purposes.

numpy.savez expects each item to be an array. In fact, it calls np.asarray on everything you pass in.

If you turn a dict into an array, you'll get an object array. E.g.

import numpy as np

test = {'a':np.arange(10), 'b':np.arange(20)}
testarr = np.asarray(test)

Similarly, if you make an array out of a string, you'll get a string array:

In [1]: np.asarray('abc')
Out[1]: 
array('abc', 
      dtype='|S3')

However, because of a quirk in the way object arrays are handled, if you pass in a single object (in your case, your dict) that isn't a tuple, list, or array, you'll get a 0-dimensional object array.

This means that you can't index it directly. In fact, doing testarr[0] will raise an IndexError. The data is still there, but you need to add a dimension first, so you have to do yourdictionary = testarr.reshape(-1)[0].

If all of this seems clunky, it's because it is. Object arrays are essentially always the wrong answer. (Although asarray should arguably pass in ndmin=1 to array, which would solve this particular problem, but potentially break other things.)

savez is intended to store arrays, rather than arbitrary objects. Because of the way it works, it can store completely arbitrary objects, but it shouldn't be used that way.

If you did want to use it, though, a quick workaround would be to do:

np.savez(outputFile, 
         saveFeature1 = [feature1], 
         saveFeature2 = [feature2], 
         saveLabel1 = [label1], 
         saveLabel2 = [label2],
         saveString = docString)

And you'd then access things with

loadedArchive = np.load(outFile)
loadedFeature1 = loadedArchive['saveFeature1'][0]
loadedString = str(loadedArchive['saveString'])

However, this is clearly much more clunky than just using pickle. Use numpy.savez when you're just saving arrays. In this case, you're saving nested data structures, not arrays.

answered Sep 16 '22 11:09

Joe Kington

If you need to save your data in a structured way, you should consider using the HDF5 file format (http://www.hdfgroup.org/HDF5/). It is very flexible, easy to use, efficient, and other software might already support it (HDFView, Mathematica, Matlab, Origin..). There is a simple python binding called h5py.

You can store datasets in a filesystem like structure and define attributes for each dataset, like a dictionary. For example:

import numpy as np
import h5py

# some data
table1 = np.array([(1,1), (2,2), (3,3)], dtype=[('x', float), ('y', float)])
table2 = np.ones(shape=(3,3))

# save to data to file
h5file = h5py.File("test.h5", "w")
h5file.create_dataset("Table1", data=table1)
h5file.create_dataset("Table2", data=table2, compression=True)
# add attributes
h5file["Table2"].attrs["attribute1"] = "some info"
h5file["Table2"].attrs["attribute2"] = 42
h5file.close()

Reading the data is also simple, you can even load just a few elements out of a large file if you want:

h5file = h5py.File("test.h5", "r")
# read from file (numpy-like behavior)
print h5file["Table1"]["x"][:2]
# read everything into memory (real numpy array)
print np.array(h5file["Table2"])
# read attributes
print h5file["Table2"].attrs["attribute1"]

More features and possibilities are found in the documentation and on the websites (the Quick Start Guide might be of interest).

answered Sep 19 '22 11:09

pwuertz

2022 Update

There is a much simpler solution to this question using Numpy's np.load(..., allow_pickle=True).

I first save an npz file as described in the question.

import numpy as np    

feature1 = {'case 1': np.arange(2), 'case 2': np.arange(3)}
feature2 = {'case 3': np.arange(4), 'case 3': np.arange(5)}
label1 = np.arange(6)
label2 = np.arange(7)
docstring = 'Commands passed to the script were...'

np.savez('test', feature1=feature1, feature2=feature2, 
        label1=label1, label2=label2, docstring=docstring)

Now one can read the file as follows

data = np.load('test.npz', allow_pickle=True)

# This is a structured array: extract the dict
feature1 = data["feature1"].item()

print("feature1 =", feature1)

# This is a normal array already
label1 = data["label1"]

print("label1 =", label1)

It produces the following

feature1 = {'case 1': array([0, 1]), 'case 2': array([0, 1, 2])}
label1 = [0 1 2 3 4 5]

answered Sep 17 '22 11:09

divenex

Related questions
                            
                                What is the difference between numpy.fft.fft and numpy.fft.rfft?
                            
                                Python - re.error: unterminated character set at position
                            
                                How to write a script bash to enter multi-line input in jupyter-notebook?
                            
                                django OSError: no library called "cairo" was found on windows
                            
                                What are some strategies to write python code that works in CPython, Jython and IronPython
                            
                                Python re.findall with groupdicts
                            
                                How does Python store lists internally?
                            
                                pylint not recognizing some of the standard library
                            
                                Tips for debugging list comprehensions?
                            
                                How to get out of a try/except inside a while? [Python]
                            
                                How is pip install using git different than just cloning a repository?
                            
                                Cython C++ and std::string
                            
                                How do I convert kilometres to degrees in Geodjango/GEOS?
                            
                                Check if a function is a method of some object
                            
                                Dynamically add member function to an instance of a class in Python
                            
                                python -- measuring pixel brightness
                            
                                3D vector field in matplotlib
                            
                                eval calling lambda don't see self
                            
                                Is it possible to print a string at a certain screen position inside IDLE?
                            
                                HTTPS request in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With