How does Amazon's Statistically Improbable Phrases work?

Tags:

How does something like Statistically Improbable Phrases work?

According to amazon:

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules

One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.

269

asked Jan 05 '10 22:01

ʞɔıu

1 Answers

It's a lot like the way Lucene ranks documents for a given search query. They use a metric called TF-IDF, where TF is term frequence and idf is inverse document frequency. The former ranks a document higher the more the query terms appear in that document, and the latter ranks a document higher if it has terms from the query that appear infrequently across all documents. The specific way they calculate it is log(number of documents / number of documents with the term) - ie, the inverse of the frequency that the term appears.

So in your example, those phrases are SIPs relative to Joel's book because they are rare phrases (appearing in few books) and they appear multiple times in his book.

Edit: in response to the question about 2-grams and 3-grams, overlap doesn't matter. Consider the sentence "my two dogs are brown". Here, the list of 2-grams is ["my two", "two dogs", "dogs are", "are brown"], and the list of 3-grams is ["my two dogs", "two dogs are", "dogs are brown"]. As I mentioned in my comment, with overlap you get N-1 2-grams and N-2 3-grams for a stream of N words. Because 2-grams can only equal other 2-grams and likewise for 3-grams, you can handle each of these cases separately. When processing 2-grams, every "word" will be a 2-gram, etc.

answered Sep 21 '22 14:09

danben

Related questions
                            
                                How to design an algorithm to calculate countdown style maths number puzzle
                            
                                How to calculate the number of coprime subsets of the set {1,2,3,..,n}
                            
                                check if a tree is a binary search tree
                            
                                What are some algorithms that will allow me to simulate planetary physics?
                            
                                Finding kth-shortest paths?
                            
                                How to generate the power-set of a given List?
                            
                                Algorithm to find the maximum sum in a sequence of overlapping intervals
                            
                                How many possible states does the 8-puzzle have?
                            
                                Inserting an equal value element
                            
                                How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N [duplicate]
                            
                                Is there any technical reason why std::lower_bound is not specialized for red-black tree iterators?
                            
                                Algorithm to find Lucky Numbers
                            
                                How can you detect if two regular expressions overlap in the strings they can match?
                            
                                what is the meaning of O(1), O(n), O(n*n) memory? [duplicate]
                            
                                Is there any good open-source or freely available Chinese segmentation algorithm available? [closed]
                            
                                How to generate a verification code/number?
                            
                                Implementing Text Justification with Dynamic Programming
                            
                                Peak finding algorithm
                            
                                Which algorithms are hard to implement in functional languages?
                            
                                Determining the big-O runtimes of these different loops?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Amazon's Statistically Improbable Phrases work?

Tags:

algorithm

platform-agnostic

nlp

ʞɔıu

People also ask

1 Answers

danben

Recent Activity

Donate For Us