Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Under what circumstances would likelihood() be useful?

Reading through the sqlite documentation I found the following function:

http://www.sqlite.org/lang_corefunc.html#likelihood

The likelihood(X,Y) function returns argument X unchanged. The value Y in likelihood(X,Y) must be a floating point constant between 0.0 and 1.0, inclusive. The likelihood(X) function is a no-op that the code generator optimizes away so that it consumes no CPU cycles during run-time (that is, during calls to sqlite3_step()). The purpose of the likelihood(X,Y) function is to provide a hint to the query planner that the argument X is a boolean that is true with a probability of approximately Y. The unlikely(X) function is short-hand for likelihood(X,0.0625).

Assuming that i know that 1 will return 75% of the time, how would:

select likelihood(x,.75) 

help the query optimizer?

like image 339
metrix Avatar asked Jan 07 '14 20:01

metrix


1 Answers

The original example was this:

Consider the following schema and query:

CREATE TABLE composer(
  cid INTEGER PRIMARY KEY,
  cname TEXT
);
CREATE TABLE album(
  aid INTEGER PRIMARY KEY,
  aname TEXT
);
CREATE TABLE track(
  tid INTEGER PRIMARY KEY,
  cid INTEGER REFERENCES composer,
  aid INTEGER REFERENCES album,
  title TEXT
);
CREATE INDEX track_i1 ON track(cid);
CREATE INDEX track_i2 ON track(aid);

SELECT DISTINCT aname
  FROM album, composer, track
 WHERE cname LIKE '%bach%'
   AND composer.cid=track.cid
   AND album.aid=track.aid;

The schema is for a (simplified) music catalog application, though similar kinds of schemas come up in other situations. There is a large number of albums. Each album contains one or more tracks. Each track has a composer. Each composer might be associated with multiple tracks.

The query asks for the name of every album that contains a track with a composer whose name matches '%bach%'.

The query planner needs to choose among several alternative algorithms for this query. The best choices hinges on how well the expression "cname LIKE '%bach%'" filters the results. Let's give this expression a "filter value" which is a number between 1.0 and 0.0. A value of 1.0 means that cname LIKE '%bach%' is true for every row in the composer table. A value of 0.0 means the expression is never true.

The current query planner (in version 3.8.0) assumes a filter value of 1.0. In other words, it assumes that the expression is always true. The planner is assuming the worst case so that it will pick a plan that minimizes worst case run-time. That's a safe approach, but it is not optimal. The plan chosen for a filter of 1.0 is track-album-composer. That means that the "track" table is in the outer loop. For each row of track, an indexed lookup occurs on album. And then an indexed lookup occurs on composer, then the LIKE expression is run to see if the album name should be output.

A better plan would be track-composer-album. This second plan avoids the album lookup if the LIKE expression is false. The current planner would choose this second algorithm if the filter value was just slightly less than 1.0. Say 0.99. In other words, if the planner thought that the LIKE expression would be false for 1 out of every 100 rows, then it would choose the second plan. That is the correct (fastest) choice for when the filter value is large.

But in the common case of a music library, the filter value is probably much closer to 0.0 than it is to 1.0. In other words, the string "bach" is unlikely to be found in most composer names. And for values near 0.0, the best plan is composer-track-album. The composer-track-album plan is to scan the composer table once looking for entries that match '%bach%" and for each matching entry use indices to look up the track and then the album. The current 3.8.0 query planner chooses this third plan when the filter value is less than about 0.1.

The likelihood functions gives the database a (hopefully) better estimate of the selectivity of a filter. With the example query, it would look like this:

SELECT DISTINCT aname
  FROM album, composer, track
 WHERE likelihood(cname LIKE '%bach%', 0.05)
   AND composer.cid=track.cid
   AND album.aid=track.aid;
like image 134
CL. Avatar answered Sep 22 '22 23:09

CL.