For coming up with webpage rankings, my understanding is that there is a query-specific score (such as how relevant a document is to the query that has been entered into the search engine) and a query-independent score (such as PageRank of the webpage, for example).
My question is, how could these two kinds of scores be merged in such a way that no one score dominates too much? My own thinking is that some kind of linear combination might work, but I'm not entirely sure.
If anyone can answer how it is done in practice, it would be great. If not, theoretical answers are also appreciated.
Of course this is a part of big Google's secret, as it was told by Geza Kerecsenyi.
But try to think of this from 2 points of views (I will explain this in very broad manner, but hope you will catch the idea):
P
is a page rank, and Q
is a query-doc rank. Then, you can say something like this:TotalRank = a*P + b*Q
The second question is how to fit those a
and b
coefficients properly, right?
Well, here we can help ourselves with a "quality measure":
query-page
pair). We can collect this dataset manually. The more we collect, the more precise measuring we'll get.TotalRank
formula is. For example, it could be MSE (Mean Squared Error) -- primitively speaking, it is calculating the difference between two values: your rank and etalone rank for the whole dataset. And thus, the closer MSE to zero, the better you've fitted a
and b
, and the better your TotalRank
formula satisfying your expectations.Having such measure, you can fit those a
and b
manually, making sure that your TotalRank
-formula satisfies your rank-expectations: you just see, that the MSE is getting closer and closer to zero. But this is a very routine job, so you can use...
Machine learning. I will not explain here, how to apply machine learning for your concrete purpose - you can find all this in Internet, Coursera etc. But would say, that having "measuring dataset", it will be enough to learn some algorithm, like Linear Regression (or more sophisticated, like Decision Trees), to fit those a
and b
automatically.
And of course, this way you can "mix up" not only 2, but rather much more ranking factors into single "formula". That's how search engines mixing up many factors like "Presence of query's words in the title of page", "Words marked with bold" etc.
Also, I would recommend to have a glance at Stanford's Introduction to Information Retrieval book. It explains lots of such questions.
P.S.: sorry for my poor English and have a good luck! :)
Search engines usually keep this a secret, since this is a large part of how the magic is done (i.e the proprietary bit), so I can only make educated guesses.
However, I think we need to start with the appreciation that it's likely not two completely independent scores we're combining. We're probably using all of the data in all of the places, instead of hand-picking what and where. Let's look at a potential example:
query: "dog"
returned objects to rank:
1. "dogs are awesome! find out more about owning a dog today!"
Query relevance: 9/10
From: some obscure blog that no-one cares about (2/10 according to PageRank)
2. "doge memes for you. Get the finest memes - doge and more!"
Query relevance: 7/10 (only 1 letter difference! Could be a typo, maybe?)
From: 9gag, first search result for anything trendy-related, so it must be good (9/10 according to PageRank)
However you try to bend and skew and weight the data, 9gag will end up on top despite being obviously wrong (sorry about the ridiculous example). Clearly, it's not as simply as shoving these two numbers together.
(notice how this section is longer than the previous one. Take it with a grain of salt.)
Imagine the entire web as a graph (like a graph theory graph), or a 'map' of sorts, with interconnected stuff. The distances between points are the PageRank distances (some measure of how close-knit PageRank deems the two sites, where higher represents larger distance and in turn lower PageRank score - therefore, pr_n=1/sum(length of all edges connecting to n)
), while the 'weights' inside the circles are the relevance to your query. Our job is to find the numbers which are relatively close to their peers (i.e high PageRank score), but also have higher weights. We can then use an algorithm of your choosing to extract the ones that are best. But this way, we're still only getting the results we got before, where dogs
and doge
are only 1 letter apart. The reasoning is, we're ignoring the query score of the other pages. Therefore, what we'll do is as follows:
(yes I realise it's not complete and missing some connections. But I have reason to believe @Joebevo is a human, who would appreciate a visually-interpretable graph and maths that doesn't go on for half an hour.)
Blue represents PageRank distance (i.e how far the pages are from each other, so lower average PageRank distance to all connected nodes represents higher PageRank score).
How can we see what we have done makes sense? Well, look back at the first graph image. The green node was small and far away, and so it ended up getting a low score in this graph. Meanwhile, the purple node was big and (relatively) close to the blue, so it scored the highest. The red node was even closer, but due to its tiny size, it came only second place.
Mathematically, we've not done anything complex - we've simply worked out an 'average' of sorts of the two scores, weighted by the importance of the middle node. This is the sort of algorithm that confuses 'doge' with 'dog'. The red node knows nothing about the orange, all they care about is the blue. To fix this, we need to repeat the process.
To decide which node to go to next, we'll use this algorithm (it's based on the theory behind the one used in Dijkstra, the most efficient path-finding algorithm):
Note that with the white-text nodes, we can multiply the distance instead of divide, since we've already normalised it to a positive scale (jargon for 'made both axis increase as result gets more accurate, instead of one increasing and another decreasing').
The only node we've not updated or used in an update (that still counts as touched, since a some connection between it and a sister node was altered) since the last update is orange, so we'll go there now. (using purple for new nodes, green for new lines)
and then we'll go to red (green nodes, black lines):
and finally (before we halt) to green (red nodes, red lines):
So, to look through the results:
Let's analyse that second bit. It is indeed quite confusing at first, but we need to look at what we've actually just done on an abstract level. Imagine it like a circuit: there's current flowing from each of the cells/ammeters/powerpacks to the others, but through wires with certain resistance. We're taking each node's value, and spreading it to its neighbour based on the distance. Another analogy is like being the ice-man, carrying ice around to houses in the hot summer. You'll gladly take an equal amount of ice to everyone, but a lot melts on the way to everyone's house. Therefore, everyone gets quantities proportional to their distance (I don't love this analogy, though, since it gives the idea that numbers can just 'leak out' from the nodes)
So, now let's go through step-by-step. Since we go straight to the purple-orange axis, via the red-green axis, we're essentially using them as holding points. Therefore, we're not using them for the first two steps for what they are. This is because, as I mentioned at the start of this section, we've not actually got a complete graph. This would fix it:
Now, not everything needs to be taken account of: only the square root of the subset (i.e 50*sqrt(2)
% of the nodes) need to be connected: those which are separated by 1 or 2 nodes, but no more. Otherwise, things would get too clunky, as the algorithm for deciding the next node would get double-recursive - it's bad enough as it is! (to be fair, the mathematical justification is there too, but it's beyond the scope of this answer (but essentially, the numbers would get less close to the 'optimal' answer)).
In conclusion, your notion of a query-independent is technically correct, but it is important to note that it is not combined fully independently from the query. It depends on the other results to form a sort of weighted average, to make sure that two results that are completely on opposite ends of the spectrum are not given the same score (e.g relevance 2 + PR 8 vs relevance 8 + PR 2). An irrelevant query is obviously no more relevant because it has a high PageRank score, and a high PageRank score is useless if it is only obtained as a result of being linked to query-irrelevant pages (e.g although 9gag is linked to from a lot of places, if you see that none of those places have anything to do with dogs, why would that high PageRank score mean anything?).
I know this answer is long, but I hope it answers your question quite clearly. Note that this is just one algorithm used, but it's enough to put 99% of devs off ever attempting a search engine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With