I am searching for a persistent data storage solution that can handle heterogenous data stored on disk. PyTables seems like an obvious choice, but the only information I can find on how to append new columns is a tutorial example. The tutorial has the user create a new table with added column, copy the old table into the new table, and finally delete the old table. This seems like a huge pain. Is this how it has to be done?
If so, what are better alternatives for storing mixed data on disk that can accommodate new columns with relative ease? I have looked at sqlite3 as well and the column options seem rather limited there, too.
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
Yes, you must create a new table and copy the original data. This is because Tables are a dense format. This gives it a huge performance benefits but one of the costs is that adding new columns is somewhat expensive.
thanks for Anthony Scopatz's answer.
I search website and in github, I found someone has shown how to add columns in PyTables. Example showing how to add a column in PyTables
orginal version ,Example showing how to add a column in PyTables, but have some difficulty to migrate.
revised version, Isolated the copying logic, while some terms is deprecated, and it has some minor error in adding new columns.
based on their's contribution, I updated the code for adding new column in PyTables. (Python 3.6, windows)
# -*- coding: utf-8 -*-
"""
PyTables, append a column
"""
import tables as tb
pth='d:/download/'
# Describe a water class
class Water(tb.IsDescription):
waterbody_name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.Int32Col(pos=2) # integer
longi = tb.Int32Col(pos=3) # integer
airpressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.Float64Col(pos=5) # double (double-precision)
# Open a file in "w"rite mode
# if don't include pth, then it will be in the same path as the code.
fileh = tb.open_file(pth+"myadd-column.h5", mode = "w")
# Create a table in the root directory and append data...
tableroot = fileh.create_table(fileh.root, 'root_table', Water,
"A table at root", tb.Filters(1))
tableroot.append([("Mediterranean", 10, 0, 10*10, 10**2),
("Mediterranean", 11, -1, 11*11, 11**2),
("Adriatic", 12, -2, 12*12, 12**2)])
print ("\nContents of the table in root:\n",
fileh.root.root_table[:])
# Create a new table in newgroup group and append several rows
group = fileh.create_group(fileh.root, "newgroup")
table = fileh.create_table(group, 'orginal_table', Water, "A table", tb.Filters(1))
table.append([("Atlantic", 10, 0, 10*10, 10**2),
("Pacific", 11, -1, 11*11, 11**2),
("Atlantic", 12, -2, 12*12, 12**2)])
print ("\nContents of the original table in newgroup:\n",
fileh.root.newgroup.orginal_table[:])
# close the file
fileh.close()
#%% Open it again in append mode
fileh = tb.open_file(pth+"myadd-column.h5", "a")
group = fileh.root.newgroup
table = group.orginal_table
# Isolated the copying logic
def append_column(table, group, name, column):
"""Returns a copy of `table` with an empty `column` appended named `name`."""
description = table.description._v_colObjects.copy()
description[name] = column
copy = tb.Table(group, table.name+"_copy", description)
# Copy the user attributes
table.attrs._f_copy(copy)
# Fill the rows of new table with default values
for i in range(table.nrows):
copy.row.append()
# Flush the rows to disk
copy.flush()
# Copy the columns of source table to destination
for col in descr:
getattr(copy.cols, col)[:] = getattr(table.cols, col)[:]
# choose wether remove the original table
# table.remove()
return copy
# Get a description of table in dictionary format
descr = table.description._v_colObjects
descr2 = descr.copy()
# Add a column to description
descr2["hot"] = tb.BoolCol(dflt=False)
# append orginal and added data to table2
table2 = append_column(table, group, "hot", tb.BoolCol(dflt=False))
# Fill the new column
table2.cols.hot[:] = [row["temperature"] > 11**2 for row in table ]
# Move table2 to table, you can use the same name as original one.
table2.move('/newgroup','new_table')
# Print the new table
print ("\nContents of the table with column added:\n",
fileh.root.newgroup.new_table[:])
# Finally, close the file
fileh.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With