Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling large number of ids in Solr

Tags:

solr

solr4

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.

How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like

&fq=-id:(id1 id2 id3 ............id5000)

The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.

One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).

Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.

like image 325
chicharito Avatar asked May 01 '13 09:05

chicharito


2 Answers

With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.

like image 193
samkass Avatar answered Sep 22 '22 10:09

samkass


We worked around this issue by implementing Sharding of the data.

Basically, without going heavily into code detail:

  • Write your own indexing code
    • use consistent hashing to decide which ID goes to which Solr server
    • index each user data to the relevant shard (it can be a several machines)
    • make sure you have redundancy
  • Query Solr shards
    • Do sharded queries in Solr using the shards parameter
    • Start an EmbeddedSolr and use it to do a sharded query
    • Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard

Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.

For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.

Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.

like image 41
Asaf Avatar answered Sep 22 '22 10:09

Asaf