I have a Python program that uses Pytables and queries a table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname')
. This indeed improved the query time (spent in where
), but it increased the time spent in the next()
built-in function by several orders of magnitude! What could be the reason?
This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.
To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tables
import time
import sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.myset
for i in xrange(500):
get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output showing the figures:
$ ./timedset.py nosort nodata Queried the set in 0.718s $ ./timedset.py sort nodata Queried the set in 0.003s $ ./timedset.py nosort withdata Queried the set in 0.597s $ ./timedset.py sort withdata Queried the set in 5.846s
Some notes:
next
function, but instead use a for row in rows
and trust that there is only one result, the slowdown still occurs.Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it? And why such a terrible slowdown? Is it a bug that I should report?
I finally understood what's going on.
The root cause is a bug and it was on my side: I was not flushing the data before making the copy in case of sort. As a result, the copy was based on data that was not complete, and so was the new sorted table. This is what caused the slowdown, and flushing when appropriate led to a less surprising result:
...
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
fp.flush() # <--
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush() # <--
return fp
...
I realized my mistake when I decided to inspect and compare the structure and data of the tables "not sorted" vs "sorted". I noticed that in the sorted case, the table had less rows. The number varied seemingly randomly from 0 to about 450 depending on the size of the data column. Moreover, in the sorted table, the id of all the rows was set to 0. I guess that when creating a table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This "may or may not" probably depends on the size of the row and the computed chunksize.
As a result, when querying the sorted table, all queries but the one with id == 0
had no result. I initially thought that raising and catching the StopIteration
error was what caused the slowdown, but that would not explain why the slowdown depends on the size of the data column.
After reading some of the code from pytables (notably table.py and tableextension.pyx), I think what happens is the following: when a column is indexed, pytables will first try to use this index to fasten the search. If some matching rows are found, only these rows will be read. But if the index indicates that no row matches the query, for some reason pytables fallbacks to a "in kernel" search, which iterates over and reads all the rows. This requires reading the full rows from disk in multiple I/Os, and this is why the size of the data column mattered. Also under a certain size of that column, pytables did not "pre-create" some rows on disk, resulting in a sorted table with no row at all. This is why on the graph the search is very fast when the column size is under 525: iterating over 0 row doesn't take much time.
I am not clear on why the iterator fallbacks on an "in kernel" search. If the searched id is clearly out of the index bounds, I don't see any reason to search it anyway... Edit: After a closer look at the code, it turns out this is because of a bug. It is present in the version I am using (3.1.1), but has been fixed in 3.2.0.
What really makes me cry is that I forgot to flush before copying only in the example of the question. In my actual program, this bug is not present! What I also did not know but found out while investigating the question is that by default pytables do not propagate indexes. This has to be required explicitly with propindexes=True
. This is why the search was slower after sorting in my application...
So moral of the story:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With