Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In PyTables, how to create nested array of variable length?

I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.

I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?

For perhaps an easier to understand table structure, here's my homegrown example:

"""Desired Pytable output:

DIEM    TEMPUS  Temperature             Data
5       0       100         Category1 <--||-->  Category2
                         x <--| |--> y          z <--|
                        0           0           0
                        2           1           1
                        4           1.33        2.67
                        6           1.5         4.5
                        8           1.6         6.4
5       1       99
                        2           2           0   
                        4           2           2
                        6           2           4
                        8           2           6
5       2       96
                        4           4           0
                        6           3           3
                        8           2.67        5.33


Note that nested arrays have variable length.
"""

import tables as ts

tableDef =      {'DIEM': ts.Int32Col(pos=0),
                'TEMPUS': ts.Int32Col(pos=1), 
                'Temperature' : ts.Float32Col(pos=2),
                'Data': 
                    {'Category1': 
                        {
                        'x': ts.Float32Col(), 
                        'y': ts.Float32Col()
                        }, 
                    'Category2': 
                        {
                        'z': ts.Float32Col(), 
                        }
                    }
                }

# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)

# get row iterator
row = table.row
for i in xrange(3):
    print '\ni=', i
    # calc some fake data
    row['DIEM'] = 5
    row['TEMPUS'] = i
    row['Temperature'] = 100-i**2

    for j in xrange(5-i):
        # Note that nested array has variable number of rows
        print 'j=', j,
        # calc some fake nested data
        val1 = 2.0*(i+j)
        val2 = val1/(j+1.0)
        val3 = val1 - val2

        ''' Magic happens here...
        How do I write 'j' rows of data to the elements of 
        Category1 and/or Category2?

        In bastardized pseudo-code, I want to do:

        row['Data/Category1/x'][j] = val1
        row['Data/Category1/y'][j] = val2
        row['Data/Category2/z'][j] = val3
        '''

    row.append()
table.flush()

fh.close()

I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?

  • EArray? VLArray? If so, how to integrate these data types into the above described structure?
  • some other idea?

Any assistance is greatly appreciated!

EDIT w/ additional info: It appears that the PyTables gurus have already addressed the "is such a structure possible" question:

PyTables Mail Forum - Hierachical Datasets

So has anyone figured out a way to create an analogous PyTable data structure?

Thanks again!

like image 576
plmcw Avatar asked Mar 20 '11 01:03

plmcw


2 Answers

I have a similar task: to dump fixed size data with arrays of a variable length.

I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.

After days of investigation I ended with the following solution:

(spoiler: we store array fields in separate EArray instances, one EArray per one array field)

  1. I store fixed size data in a regular pytables table.
  2. I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:

    class Particle(IsDescription):
       idnumber  = Int64Col()
       ADCcount  = UInt16Col()
       TDCcount  = UInt8Col()
       grid_i    = Int32Col()
       grid_j    = Int32Col()
       pressure  = Float32Col()
       energy    = FloatCol()
       buffer_Offset = UInt32() # note this field!
       buffer_Length = UInt32() # and this one too!
    
  3. I also create one EArray instance per each array field:

    datatype = StringAtom(1)
    buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")
    
  4. Then I add rows corresponding to a fixed size data:

    row['idnumber'] = ...
    ...
    row['energy'] = ...
    row['buffer_Offset'] = buffer.nrows
    # my_buf is a string (I get it from a stream)
    row['buffer_Length'] = len(my_buf)
    table.append(row)
    
  5. Ta-dah! Add the buffer into the array.

    buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))
    
  6. That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)

  7. Getting the buffer data is easy. Suppose that row is a single row you read from your DB:

    # Open array for reading
    buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
    ...
    row = ...
    ...
    bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]
    
like image 73
Zinovy Nis Avatar answered Nov 13 '22 21:11

Zinovy Nis


This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":

http://www.mail-archive.com/[email protected]/msg01207.html

In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).

But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:

Trial Data
1     0.4, 0.5, 0.45
2     0.3, 0.4, 0.45, 0.56

becomes

Trial Timepoint Data
1     1         0.4
1     2         0.5
...
2     4         0.56

Data above is a single number, but it could be, e.g. a 4x5x3 atom.

If nested VLArrays are supported in PyTables now, I'd certainly love to know!

Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!

like image 20
Dav Clark Avatar answered Nov 13 '22 22:11

Dav Clark