Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Thinking Sphinx indexing performance

I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.

I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.

My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612

How can I optimize the generated query to run faster during indexing?

like image 479
Jonathan Avatar asked Oct 20 '22 09:10

Jonathan


2 Answers

The 'standard' sphinx solution to this would be to use ranged queries.

http://sphinxsearch.com/docs/current.html#ex-ranged-queries

... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)

But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.

like image 166
barryhunter Avatar answered Oct 22 '22 21:10

barryhunter


This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.

It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:

ThinkingSphinx::Index.define(
  :incident,
  with: :active_record,
  delta?: false,
  delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
    indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
    ...
    has
    ...
end

ThinkingSphinx::Index.define(
  :incident,
  with: :active_record,
  delta?: true,
  delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
    indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
    ...
    has
    ...
end

The magic that defines the field site as a separate query is the option source: :query at the end of the line.

Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.

I found sharding didn't perform any better, so I reverted to one unified index.

See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.

Important to remember!

The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.

So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.

like image 43
Jonathan Avatar answered Oct 22 '22 23:10

Jonathan