I have a very large database - I'm working with a subset that is 350m rows, but ultimately it will be around 3b rows. My entire goal here is to optimize a particular type of query on this database, at the expense of pretty much everything but memory. The db file I'm working with right now is compressed with blosc at level 1, on PyTables version 2.3.1 (I could update, if that would help). Each row has thirteen entries - a typical entry looks like this:
['179', '0', '1', '51865852', '51908076', '42224', '22', '2', '20', '22', '2', '0.0516910530103', '0.0511359922511']
They're all numerical, but not necessarily the same type. I'm currently storing them in a PyTables Table, with this definition:
ind = tables.UInt16Col(pos=0)
hap = tables.UInt8Col(pos=1)
chrom = tables.UInt8Col(pos=2)
hap_start = tables.Int32Col(pos=3)
hap_end = tables.Int32Col(pos=4)
hap_len = tables.Int16Col(pos=5)
mh_sites = tables.Int16Col(pos=6)
mh_alt = tables.Int16Col(pos=7)
mh_n_ref = tables.Int16Col(pos=8)
all_sites = tables.Int16Col(pos=9)
all_alt = tables.Int16Col(pos=10)
freq = tables.Float32Col(pos=11)
std_dev = tables.Float32Col(pos=12)
I don't really care how long it takes to set up this database - I'll ultimately be creating it once and then just accessing it. My queries are of the form:
a = [ x[:] for x in hap_table.where('''(mh_sites == 15) & (hap_len > 25000) & (hap_len < 30000) & (freq > .38) & (freq < .4) & (std_dev > .3) & (std_dev < .4)''')]
Basically I'm searching for entries that match a particular row around a given tolerance. On my smaller database (350m rows), that query takes 38 seconds if I've indexed all four columns that I'm searching on:
byteorder := 'little'
chunkshape := (32768,)
autoIndex := True
colindexes := {
"hap_len": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"freq": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"std_dev": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"mh_sites": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
and 10 seconds if I do not index beforehand. I'm not sure I understand why the query is slower on the indexed database.. maybe the indexing creates overhead that isn't necessary?
Like I said, my goal is to optimize this type of query as much as possible - at the expense of basically everything but memory usage (I want to use about 2G, and really don't want to use more than about 5G). I've tried indexing and it doesn't seem to work. All of my queries are on a single value of mh_sites, and there are only about 100 possible values, so I thought about splitting it up into multiple tables, so I'm only searching a subset of the data at any one time (although I'm not entirely sure how to do that, other than mydata.root.table_1, mydata.root.table_2, etc). I've also thought about trying to store it as an array instead - maybe a float array, and then convert everything else to ints when I need to use them? If it makes a difference, my queries usually return between 20k and 500k results.
Any suggestions on optimizing this query?
I figured out how to make this a lot faster - and my solution might help other people so I'm posting it here.
I was confused about how indexing in PyTables worked. I figured that a CSI would actually sort the data, but that's not the case - once you append a set of rows they are always in that order. For me, it was well worth the overhead to sort the data before inserting it. My query times decreased by 1-2 orders of magnitude.
This also explains why indexing the table actually increased the query time - because the rows were essentially randomly distributed, I needed to read the entire database for every query. So it didn't matter if pytables could use the index to figure out which blocks were necessary to read, because it needed to read all blocks anyway. So the index just added overhead when the data was unsorted. With a sorted table, the index definitely helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With