I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.
I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?
For perhaps an easier to understand table structure, here's my homegrown example:
"""Desired Pytable output:
DIEM TEMPUS Temperature Data
5 0 100 Category1 <--||--> Category2
x <--| |--> y z <--|
0 0 0
2 1 1
4 1.33 2.67
6 1.5 4.5
8 1.6 6.4
5 1 99
2 2 0
4 2 2
6 2 4
8 2 6
5 2 96
4 4 0
6 3 3
8 2.67 5.33
Note that nested arrays have variable length.
"""
import tables as ts
tableDef = {'DIEM': ts.Int32Col(pos=0),
'TEMPUS': ts.Int32Col(pos=1),
'Temperature' : ts.Float32Col(pos=2),
'Data':
{'Category1':
{
'x': ts.Float32Col(),
'y': ts.Float32Col()
},
'Category2':
{
'z': ts.Float32Col(),
}
}
}
# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)
# get row iterator
row = table.row
for i in xrange(3):
print '\ni=', i
# calc some fake data
row['DIEM'] = 5
row['TEMPUS'] = i
row['Temperature'] = 100-i**2
for j in xrange(5-i):
# Note that nested array has variable number of rows
print 'j=', j,
# calc some fake nested data
val1 = 2.0*(i+j)
val2 = val1/(j+1.0)
val3 = val1 - val2
''' Magic happens here...
How do I write 'j' rows of data to the elements of
Category1 and/or Category2?
In bastardized pseudo-code, I want to do:
row['Data/Category1/x'][j] = val1
row['Data/Category1/y'][j] = val2
row['Data/Category2/z'][j] = val3
'''
row.append()
table.flush()
fh.close()
I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?
Any assistance is greatly appreciated!
EDIT w/ additional info: It appears that the PyTables gurus have already addressed the "is such a structure possible" question:
PyTables Mail Forum - Hierachical Datasets
So has anyone figured out a way to create an analogous PyTable data structure?
Thanks again!
I have a similar task: to dump fixed size data with arrays of a variable length.
I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.
After days of investigation I ended with the following solution:
(spoiler: we store array fields in separate EArray instances, one EArray per one array field)
I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:
class Particle(IsDescription):
idnumber = Int64Col()
ADCcount = UInt16Col()
TDCcount = UInt8Col()
grid_i = Int32Col()
grid_j = Int32Col()
pressure = Float32Col()
energy = FloatCol()
buffer_Offset = UInt32() # note this field!
buffer_Length = UInt32() # and this one too!
I also create one EArray instance per each array field:
datatype = StringAtom(1)
buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")
Then I add rows corresponding to a fixed size data:
row['idnumber'] = ...
...
row['energy'] = ...
row['buffer_Offset'] = buffer.nrows
# my_buf is a string (I get it from a stream)
row['buffer_Length'] = len(my_buf)
table.append(row)
Ta-dah! Add the buffer into the array.
buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))
That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)
Getting the buffer data is easy. Suppose that row is a single row you read from your DB:
# Open array for reading
buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
...
row = ...
...
bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]
This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":
http://www.mail-archive.com/[email protected]/msg01207.html
In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).
But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:
Trial Data
1 0.4, 0.5, 0.45
2 0.3, 0.4, 0.45, 0.56
becomes
Trial Timepoint Data
1 1 0.4
1 2 0.5
...
2 4 0.56
Data above is a single number, but it could be, e.g. a 4x5x3 atom.
If nested VLArrays are supported in PyTables now, I'd certainly love to know!
Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With