Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to estimate the quality of a web page?

I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?

You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.

Here's an example: Say the input is buy aspirin in south la. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I don't want to make an active link)

Here's the first paragraph of the text:

The bare of purchasing prescription drugs from Canada is big in the U.S. at this moment. This is because in the U.S. prescription drug prices bang skyrocketed making it arduous for those who bang limited or concentrated incomes to buy their much needed medications. Americans pay more for their drugs than anyone in the class.

The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.

like image 780
Fluffy Avatar asked Dec 23 '22 03:12

Fluffy


2 Answers

N-gram Language Models

You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.

You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.

Better Scoring through Bayes Law

When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).

However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.

To get either of these, you'll need to use Bayes Law, which states

           P(B|A)P(A)
P(A|B) =  ------------
              P(B)

Using Bayes law, we have

P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

and

P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.

The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.

Classification Only

However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.

Tools

In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.

like image 155
dmcer Avatar answered Dec 27 '22 05:12

dmcer


Define 'quality' of a web - page? What is the metric?

If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.

The markup and hosting of those pages may however be sound engineering ..

But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...

like image 39
lexu Avatar answered Dec 27 '22 04:12

lexu