Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to filter fields stored in a remote database in solr/lucene?

I have an index of about 100k documents that represent a movie entity.

Users can put films on various lists (like favorites etc.)

These lists are stored in a mysql database and are not indexed in solr.

I could store the user ids in multivalued fields that represent a list, but that is quite bad because the fields would get very, very long and the indexing would be problematic too.

So currently i do the following (pseudocode):

$favorites = SELECT document_id FROM favorites WHERE user_id = $user_id
$documents = 'http://solr.com:8393/select/?q=XYZ&fq=document_id:('.join(' OR ',$favorites);

this works great and fast but the number of items in filter queries is limited to 1024 (i tried that). also filter queries add up. so if i have one filter query with 500 values to filter i can have another values to 524 filters on another field.

It's okay for now because I limited the entries per list to 1024, and that's quite a lot but I think this approach is very clumsy and produces a lot of overhead.

Isn't there a better solution? Like writing a solr module that directly connects to the database or something? I'd like to do it in php.

If there is no other way, can i somehow raise the 1024 limit? because it works very fast now! I think with good hardware more wouldn't be a problem.

Edit: as asked in the comments i here post my original schema and a working example query.

<field name="film_id" type="int" indexed="true" stored="true" required="true"/> 
<field name="imdb_id" type="int" indexed="true" stored="true" /> 
<field name="parent_id" type="int" indexed="true" stored="true"/> 
<field name="malus" type="int" indexed="true" stored="true"/> 
<field name="type" type="int" indexed="true" stored="true"/> 
<field name="year" type="int" indexed="true" stored="true" termVectors="true"/> 
<field name="locale_title" type="string" indexed="false" stored="true"/> 
<field name="aka_title" type="filmtitle" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> 
<field name="sort_title" type="string" indexed="true" stored="true"/> 
<field name="director" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="director_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="actor" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="actor_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="country" type="string" indexed="true" stored="true" multiValued="true"/> 
<field name="description" type="text" indexed="true" stored="true" /> 
<field name="genre" type="genre" indexed="true" stored="true" multiValued="true" termVectors="true"/> 
<field name="url" type="string" indexed="true" stored="true" multiValued="false"/> 
<field name="image_url" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="rating" type="int" indexed="true" stored="true" required="false" default="50"/>
<field name="affiliate" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="product_type" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="product_*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="blockbuster" type="boolean" indexed="true" stored="true" /> 
<copyField source="film_id" dest="id"/>
<field name="director_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="actor_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>

theese are my additions to the default schema.xml

a sample search result can be viewed here.

a sample query would be:

http://my-server.com:8983/solr/select/?
q=description:nazis
&fq=product_bluray:amazon
&fq=film_id:(1185616 1054606 88763 361748 78748)

here the user would search for movies that are:

  • available on amazon as a bluray
  • that have the term "nazis" in the description
  • AND that are on his favorite list

the list includes the movies (documents) with the ids 1185616 1054606 88763 361748 78748 and are stored in the mysql database.

ps: I don't know whether I formulated the question well, I hope its understandable. If not, please feel free to edit!

like image 516
The Surrican Avatar asked Apr 13 '11 14:04

The Surrican


1 Answers

Step one is to make sure you really want to use Solr. Looking at your schema, there's an awful lot in there that is susceptible to a normal RDBMS with basic text indexing. Take half an hour and look at postgresql unless you've already determined that a regular good old fashioned RDBMS with some extra bells an whistles just won't do for you.

There's a lot of interest in this problem in the Solr community, but there isn't a real solution.

The obvious approach is to reindex a "favorited" document every time someone favorites it with their username in a multivalued field. This is brain-dead, of course, but that doesn't mean it won't work, depending on how often one of your users mess with his/her favorites list. If your documents are on the small size (I assume they are only a few K) and you have can get enough hardware to keep the whole index in memory (likely since you've only got 100K documents) this might be the approach to consider. You can test it by building an index of a size that you can actually fit into the memory available and implement the strategy. See if it's fast enough.

You may also be able to 'batch' these operations if people don't add a gazillion favorites in one go, like this:

  • Day 1: I add ten items to my favorites. You stick their ID's in a database and use that list of ID's to filter my queries.
  • Night 1: You update all the documents that have been favorited by anyone during the day, adding my username to the "favoritedBy" multiValued field. Remove my favorited list from the DB, since it's now represented in the Solr index itself.
  • Day 2: I add three more items to my favorites. You filter on both favorited:myusername and id:(newID1 OR newID2 or newID3).

This may work for you if people add a reasonable number of favorites per day and you don't have a lot of traffic at night.

like image 111
Bill Dueber Avatar answered Nov 04 '22 07:11

Bill Dueber