Search for a term on amazon.com, for example "stack overflow", and the search results come back very quickly.
On the left hand side of the window, there is a faceted search that shows in certain categories, the count of products that match that term.
You can then drill into those terms. For example, there are 1094 books that match the term, which is broken down into Computers & Internet (1003), Science, etc.
Given that the search for books covers the contents of some of those books, it strikes me that this is a very impressive feat.
How does amazon do this? Massive parallelization? eg each node knows about a few products?
Incidentally, I saw that "stack overflow" appears in the text of "Soul of a New Machine", a book I remember from 1981
The short answer is, a lot of indexing. The longer answer is, a lot of indexing, a lot of redundancy, a lot of caching, and smart partitioning.
The real answer is -- read this book: http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
(It's free, and it's very good).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With