Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure or algorithm for second degree lookups in sub-linear time?

Tags:

Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?

For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.

Is there a way to efficiently do a query like "get all books by authors who were born in Chicago"? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).

I know databases do something like this in certain joins, and Endeca claims to be able to do this "fast" using what they call "Record Relationship Navigation", but I haven't been able to find anything about the actual algorithms used or even their computational complexity.

I'm not particularly concerned with the exact data structure... I'd be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.

Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000...) Is there a generalized n-degree algorithm, and what is its performance characteristics?

Edit:

I am probably just really dense, but I don't see how the inverted index suggestion helps. For example, say I had the following data:

DATA
1.  Milton        England
2.  Shakespeare   England
3.  Twain         USA

4.  Milton        Paridise Lost
5.  Shakespeare   Hamlet
6.  Shakespeare   Othello
7.  Twain         Tom Sawyer
8.  Twain         Huck Finn

INDEX
"Milton"         (1, 4)
"Shakespeare"    (2, 5, 6)
"Twain"          (3, 7, 8)
"Paridise Lost"  (4)
"Hamlet"         (5)
"Othello"        (6)
"Tom Sawyer"     (7)
"Huck Finn"      (8)
"England"        (1, 2)
"USA"            (3)

Say I did my query on "books by authors from England". Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I'd have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.

Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.

like image 866
levand Avatar asked Jan 09 '09 01:01

levand


2 Answers

For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:

  1. Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
  2. Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
  3. Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
    • Are they the same? If so:
      • Output the (author, book name) pair from top(B)
      • Remove the top element of the B pile
      • Goto 3.
    • Otherwise, is top(A).author < top(B).author? If so:
      • Remove the top element of the A pile
      • Goto 3.
    • Otherwise, it must be that top(A).author > top(B).author:
      • Remove the top element of the B pile
      • Goto 3.

* (Or O(0) time if the table is already sorted by author, or has an index which is.)

The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).

Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.

like image 122
j_random_hacker Avatar answered Sep 30 '22 15:09

j_random_hacker


SELECT a.*, b.*
   FROM Authors AS a, Books AS b
   WHERE a.author_id = b.author_id
     AND a.birth_city = "Chicago"
     AND a.birth_state = "IL";

A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)

Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.

like image 38
Jonathan Leffler Avatar answered Sep 30 '22 14:09

Jonathan Leffler