I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query
, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query
and define it separately as a sql_joined_field
in the sphinx.conf
file.
It's important to add all relevant sql conditions to each sql_joined_field
(such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site
as a separate query is the option source: :query
at the end of the line.
Notice the core index definition has the parameter delta?: false
, while the delta index definition has the parameter delta?: true
. That's so I could use the condition WHERE incidents.delta = 1
in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident
), I would have to run rake ts:configure
to find out my new offset and change incidents.id * 51 + 7
accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With