Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What makes a good autowarming query in Solr and how do they work?

This question is a follow up to this question about infrequent, isolated read timeouts in a solr installation.

As a possible problem missing / bad autowarming queries for new searchers were found.

Now I am confused about how good autowarming queries should "look like".

I read up but couldnt find any good documentation on this.

Should they hit a lot of documents in the index? Or should they have matches in all distinct fields that exist in the index?

Wouldnt just *:* be the best autowarming query or why not?

The example solr config has theese sample queries in it:

<lst><str name="q">solr</str> <str name="start">0</str> <str name="rows">10</str></lst>
<lst><str name="q">rocks</str> <str name="start">0</str> <str name="rows">10</str></lst>

I changed them to:

<lst><str name="q">george</str> <str name="start">0</str> <str name="rows">10</str></lst>

Why? Because the index holds film entities with fields for titles and actors. Those are the most searched ones. And george appears in titles and actors.

I don't really know whether this makes sense. So my question is:

  • What would be good autowarming queries for my index and why?
  • What makes a good autowarming query?

This is an example document from the index. The index has about 70,000 documents and they all look like this (only different values of course): example document:

 <doc> 
  <arr name="actor"><str>Tommy Lee Jones</str><str>Will Smith</str><str>Rip Torn</str> 
    <str>Lara Flynn Boyle</str><str>Johnny Knoxville</str><str>Rosario Dawson</str><str>Tony Shalhoub</str> 
    <str>Patrick Warburton</str><str>Jack Kehler</str><str>David Cross</str><str>Colombe Jacobsen-Derstine</str> 
    <str>Peter Spellos</str><str>Michael Rivkin</str><str>Michael Bailey Smith</str><str>Lenny Venito</str> 
    <str>Howard Spiegel</str><str>Alpheus Merchant</str><str>Jay Johnston</str><str>Joel McKinnon Miller</str> 
    <str>Derek Cecil</str></arr> 
  <arr name="affiliate"><str>amazon</str></arr> 
  <arr name="aka_title"><str>Men in Black II</str><str>MIB 2</str><str>MIIB</str> 
    <str>Men in Black 2</str><str>Men in black II (Hombres de negro II)</str><str>Hombres de negro II</str><str>Hommes en noir II</str></arr> 
  <bool name="blockbuster">false</bool> 
  <arr name="country"><str>US</str></arr> 
  <str name="description">Agent J (Will Smith) muss die Erde wieder vor einigem Abschaum bewahren, denn in Gestalt des verführerischen Dessous-Models Serleena (Lara Flynn Boyle) will ein Alien den Planeten unterjochen. Dabei benötigt J die Hilfe seines alten Partners Agent K (Tommy Lee Jones). Der wurde aber bei seiner "Entlassung" geblitzdingst, und so muß J seine Erinnerung erst mal etwas auffrischen bevor es auf die Jagd gehen kann.</str> 
  <arr name="director"><str>Barry Sonnenfeld</str></arr> 
  <int name="film_id">120912</int> 
  <arr name="genre"><str>Action</str><str>Komödie</str><str>Science Fiction</str></arr> 
  <str name="id">120912</str> 
  <str name="image_url">/media/search/filmcovers/105x/kf/false/F6Q1XW.jpg</str> 
  <int name="imdb_id">120912</int> 
  <date name="last_modified">2011-03-01T18:51:35.903Z</date> 
  <str name="locale_title">Men in Black II</str> 
  <int name="malus">3238</int> 
  <int name="parent_id">0</int> 
  <arr name="product_dvd"><str>amazon</str></arr> 
  <arr name="product_type"><str>dvd</str></arr> 
  <int name="rating">49</int> 
  <str name="sort_title">meninblack</str> 
  <int name="type">1</int> 
  <str name="url">/film/Men-in-Black-II-Barry-Sonnenfeld-Tommy-Lee-Jones-F6Q1XW/</str> 
  <int name="year">2002</int> 
 </doc> 

Most queries are exact match queries on actor fields with some filters in place.

Example:

INFO: [] webapp=/solr path=/select/ params={facet=true&sort=score+asc,+malus+asc,+year+desc&hl.simple.pre=starthl&hl=true&version=2.2&fl=*,score&facet.query=year:[1900+TO+1950]&facet.query=year:[1951+TO+1980]&facet.query=year:[1981+TO+1990]&facet.query=year:[1991+TO+2000]&facet.query=year:[2001+TO+2011]&bf=div(sub(10000,malus),100)^10&hl.simple.post=endhl&facet.field=genre&facet.field=country&facet.field=blockbuster&facet.field=affiliate&facet.field=product_type&qs=5&qt=dismax&hl.fragsize=200&mm=2&facet.mincount=1&qf=actor^0.1&f.blockbuster.facet.mincount=0&f.genre.facet.limit=20&hl.fl=actor&wt=json&f.affiliate.facet.mincount=1&f.country.facet.limit=20&rows=10&pf=actor^5&start=0&q="Josi+Kleinpeter"&ps=3} hits=1 status=0 QTime=4

like image 621
The Surrican Avatar asked Mar 02 '11 16:03

The Surrican


People also ask

How does Solr cache work?

Caching in Solr differs from caching in many other applications in that cached Solr objects do not expire after a time interval; instead, they remain valid for the lifetime of the Index Searcher. When a new searcher is opened, the current searcher continues servicing requests while the new one auto-warms its cache.

How do I clear Solr cache?

Go to Core Admin and click Reload. When refreshed, you should see a green check mark on the "current" field. Show activity on this post. Disable all the caches from solrconfig.

Why Solr is fast?

For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster.


1 Answers

There are 2 types of warming. Query cache warming and document cache warming (There's also filters, but those are similar to queries). Query cache warming can be done through a setting which will just re-run X number of recent queries before the index was reloaded. Document cache warming is different.

The goal of document cache warming is to get a large quantity of your most frequently accessed documents into the document caches so they don't have to be read from disk. So, your queries should focus on this. You need to try and figure out what your most frequently searched documents are and load those. Preferably with a minimal number of queries. This has nothing to do with the actual content of the fields. EDIT: To clarify. When warming document caches your primary interest is the documents that turn up in search RESULTS most often, regardless of how they are queried.

Personally, I'd run searches for things like:

  • Loading by country, if most of your searches are for US films.
  • Loading by year, if most of your searches are for more recent films.
  • Loading by genre, if you have a short list of heavily searched genres.

A last possibility is to load them all. Your documents look small. 70,000 of them is nothing in terms of server memory nowadays. If your document cache is large enough, and you have enough memory available, go for it. As a side note, some of your biggest benefit will be from your document cache. A query cache is only beneficial for repeated queries, which can be disappointingly low. You almost always benefit from a large document cache.

like image 170
rfeak Avatar answered Sep 21 '22 07:09

rfeak