I am trying to read the mat file given in the following website, ufldl.stanford.edu/housenumbers, in the file train.tar.gz, there is a mat file named digitStruct.mat.
when i used scipy.io to read the mat file, it alerts me with the message ' please use hdf reader for matlab v7.3 files '.
the original matlab file is provided as below
load digitStruct.mat
for i = 1:length(digitStruct)
im = imread([digitStruct(i).name]);
for j = 1:length(digitStruct(i).bbox)
[height, width] = size(im);
aa = max(digitStruct(i).bbox(j).top+1,1);
bb = min(digitStruct(i).bbox(j).top+digitStruct(i).bbox(j).height, height);
cc = max(digitStruct(i).bbox(j).left+1,1);
dd = min(digitStruct(i).bbox(j).left+digitStruct(i).bbox(j).width, width);
imshow(im(aa:bb, cc:dd, :));
fprintf('%d\n',digitStruct(i).bbox(j).label );
pause;
end
end
as shown above, the mat file has the key 'digitStruct', and within 'digitStruct', key 'name' and 'bbox' can be found, I used h5py API to read the file.
import h5py
f = h5py.File('train.mat')
print len( f['digitStruct']['name'] ), len(f['digitStruct']['bbox'] )
I can read the array, however when I loop though the array, how can I read each item?
for i in f['digitStruct']['name']:
print i # only print out the HDF5 ref
By default, Python is not capable of reading . mat files.
How to Open an MAT File. MAT files that are Microsoft Access Shortcut files can be created by dragging a table out of Access and to the desktop or into another folder. Microsoft Access needs to be installed in order to use them. MATLAB from MathWorks can open MAT files that are used by that program.
Writing in Matlab:
test = {'Hello', 'world!'; 'Good', 'morning'; 'See', 'you!'};
save('data.mat', 'test', '-v7.3') % v7.3 so that it is readable by h5py
Reading in Python (works for any number or rows or columns, but assumes that each cell is a string):
import h5py
import numpy as np
data = []
with h5py.File("data.mat") as f:
for column in f['test']:
row_data = []
for row_number in range(len(column)):
row_data.append(''.join(map(unichr, f[column[row_number]][:])))
data.append(row_data)
print data
print np.transpose(data)
Output:
[[u'Hello', u'Good', u'See'], [u'world!', u'morning', u'you!']]
[[u'Hello' u'world!']
[u'Good' u'morning']
[u'See' u'you!']]
import numpy as np
import cPickle as pickle
import h5py
f = h5py.File('train/digitStruct.mat')
metadata= {}
metadata['height'] = []
metadata['label'] = []
metadata['left'] = []
metadata['top'] = []
metadata['width'] = []
def print_attrs(name, obj):
vals = []
if obj.shape[0] == 1:
vals.append(int(obj[0][0]))
else:
for k in range(obj.shape[0]):
vals.append(int(f[obj[k][0]][0][0]))
metadata[name].append(vals)
for item in f['/digitStruct/bbox']:
f[item[0]].visititems(print_attrs)
with open('train_metadata.pickle','wb') as pf:
pickle.dump(metadata, pf, pickle.HIGHEST_PROTOCOL)
I modified it from https://discussions.udacity.com/t/how-to-deal-with-mat-files/160657/3. Honestly, I can't figure out exactly what's going on with visititmes()
. The HDF5 file is just too hierarchy and too abstract.
This metadata is a dictionary. The content of each key is an embedded array. The array has 33402 items, which correspond to the png file with the name in order. Each item is an array with length from 1~6. I count the numbers of different digits, which is 5137, 18130, 8691, 1434, 9,1.
What surprises me is that the pickle file is only 9 MB, more than 20 times smaller than the mat file. I guess the HDS file sacrifice storing space for hierarchy structure.
Note: I have converted the values to integers, for the sake of slicing images. Now the train_metadata.pickle file has a size of only 2 MB, 100 times than the mat file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With