What are useful ranking algorithms for documents without links?

Tags:

I've looked at Algorithms of the Intelligent Web that describes (page 55) an interesting algorithm - called DocRank - for creating a PageRank like score for business documents (i.e. documents without links like PDF, MS Word documents, etc...). In short it analyzes term frequency intersection between each document in a collection.

Can anyone else identify interesting algorithms described elsewhere, or wants to share something novel here, to apply against these types of documents to improve search results?

Please forgo answers involving things like click tracking or other actions NOT about analyzing the actual documents.

517

asked Dec 06 '10 01:12

orangepips

2 Answers

First Technique: step-wise similarity

I can offer one example--that i've actually tested/validated against real data. If you were to gather a number of techniques and rank them along two axes--inherent complexity or ease of implementation AND performance (resolution, or predictive accuracy), this technique would be high on the first axis, and somewhere near the middle on the second; a simple and effective technique, but which might underperform against state-of-the-art techniques.

We found that the combination of low-frequency keyword intersection combined with similarity among readers/viewers of a document, is a fairly strong predictor of the document's content. Put another way: if two documents have the a similar set of very low-frequency terms (e.g., domain-specific terms, like 'decision manifold', etc.) and they have similar inbound traffic profiles, that combination is strongly probative of similarity of the documents.

The relevant details:

first filter: low-frequency terms. We parsed a large set of documents to get the term frequency for each. We used this word frequency spectrum as a 'fingerprint', which is common, but we applied an inverse weighting, so that the common terms ('a', 'of', 'the') counted very little in the similarity measure, whereas the rare terms counted a lot (this is quite common, as you probably know).

Trying to determine whether two documents were similar based on this along was problematic; for instance, two documents might share a list of rare terms relating to MMOs, and still the documents weren't similar because one is directed to playing MMOs and the other to designing them.

second filter: readers. Obviously we don't know who had read these documents, so we inferred readership from traffic sources. You can see how that helped in the example above. The inbound traffic for the MMO player site/document reflected the content, likewise for the document directed to MMO design.

Second Technique: kernel Principal Component Analysis (kPCA)

kPCA is unsupervised technique (the class labels are removed from the data before the data is passed in). At the heart of the technique is just an eigenvector-based decomposition of a matrix (in this case a covariance matrix). This technique handles non-linearity via the kernel trick, which just maps the data to a higher dimensional features space then performs the PCA in that space. In Python/NumPy/SciPy it is about 25 lines of code.

The data is collected from very simple text parsing of literary works--in particular, most of the published works of these four authors: Shakespeare, Jane Austen, Jack London, Milton. (I believe, though i'm not certain, that normal college students take course in which they are assigned to read novels by these authors.)

This dataset is widely used in ML and is available from a number of places on the Web.

So these works were divided into 872 pieces (corresponding roughly to chapters in novels); in other words, about 220 different substantial pieces of text for each of the four authors.

Next a word-frequency scan was performed on the combined corpus text, and the 70 most common words were selected for the study, the remainder of the results of the frequency scan were discarded.

Those 70 words were:

[ 'a', 'all', 'also', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been',
  'but', 'by', 'can', 'do', 'down', 'even', 'every', 'for', 'from', 'had',
  'has', 'have', 'her', 'his', 'if', 'in', 'into', 'is', 'it', 'its', 'may',
  'more', 'must', 'my', 'no', 'not', 'now', 'of', 'on', 'one', 'only', 'or', 
  'our', 'should', 'so', 'some', 'such', 'than', 'that', 'the', 'their', 
  'then', 'there', 'things', 'this', 'to', 'up', 'upon', 'was', 'were', 'what',
  'when', 'which', 'who', 'will', 'with', 'would', 'your', 'BookID', 'Author' ]

These became the field (column) names. Finally, one data row corresponding to the 872 texts was prepared (from the truncated word frequency scan). Here's one of those data points:

[ 46, 12, 0, 3, 66, 9, 4, 16, 13, 13, 4, 8, 8, 1, 0, 1, 5, 0, 21, 12, 
  16, 3, 6, 62, 3, 3, 30, 3, 9, 14, 1, 2, 6, 5, 0, 10, 16, 2, 54, 7, 8,
  1, 7, 0, 4, 7, 1, 3, 3, 17, 67, 6, 2, 5, 1, 4, 47, 2, 3, 40, 11, 7, 5,
  6, 8, 4, 9, 1, 0, 1 ]

In sum, the data is comprised of 70 dimensions (each dimension is the frequency or total count of a particular word, in a given text of one of these four authors.

Again, although this data is primarily used for supervised classification (the class labels are there for a reason), the technique i used was unsupervised--put another way, i never showed the class labels to the algorithm. The kPCA algorithm has absolutely no idea what those four different clusters (shown in the plot below) correspond to nor how each differs from the other--the algorithm did not even know how many groups (classes) that data was comprised of. I just gave it the data, and it partitioned it very neatly into four distinct groups based on an inherent ordering.

The results: alt text

Again, the algorithm i used here is kPCA. Using Python, NumPy, and Matplotlib, the script that produced these results is about 80 lines of code--for the IO, data processing, applying kPCA, and plotting the result.

Not much, but too much for an SO post. In any event, anyone who wants this code can get it from my repo. At the same time, there is also a complete, well-documented kPCA algorithm coded in python + numpy in each of these python packages (all available from mloss.org): shogun ('A Large Scale Machine Learning Toolbox'), 'sdpy (a set of modules directed to computer vision and machine learning), and mlpy ('Machine Learning in PYthon').

answered Sep 21 '22 01:09

doug

Another interesting algorithm - TextRank - sounds very similar to DocRank referenced in original question. Graph based, unsupervised, iterate until convergence.

Java implementation.

answered Sep 22 '22 01:09

orangepips

Related questions
                            
                                STL Hash Functions
                            
                                Easy brain stumper, a better algorithm [closed]
                            
                                Finding an element in an array that isn't repeated a multiple of three times?
                            
                                How can I get a cubic bezier curve closest to given points?
                            
                                How to create pastel colors programmatically in C#?
                            
                                Minimum value of maximum values in sub-segments ... in O(n) complexity
                            
                                An efficient way to simulate many particle collisions?
                            
                                algorithm to find max independent set in a tree
                            
                                Finding fastest path at additional condition
                            
                                Dependency Algorithm - find a minimum set of packages to install
                            
                                An efficient algorithm to count the number of integer grids
                            
                                Keeping track of the median of an expanding array
                            
                                Rolling median in C - Turlach implementation
                            
                                Floyd-Warshall: all shortest paths
                            
                                What is the state of the art in computer chess tree searching?
                            
                                Assembly mod algorithm on processor with no division operator
                            
                                Testing whether a polygon is simple or complex
                            
                                How to implement a dictionary (Trie vs HashTable and important issues)?
                            
                                Given a word, convert it into a palindrome with minimum addition of letters to it [closed]
                            
                                Calculate which products together would deliver the requested power

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are useful ranking algorithms for documents without links?

Tags:

algorithm

search

machine-learning

orangepips

People also ask

2 Answers

doug

orangepips

Recent Activity

Donate For Us