Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect keyword stuffing?

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.

We have noticed, that there is keyword-stuffing abuse.

We have determined two main kinds of abuse:

  1. Repeating the same term, again and again
  2. Many, irrelevant terms added to the document en-masse

These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.

Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.

My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?

Any guidance would be appreciated!

like image 278
Dave Bish Avatar asked Jun 06 '13 11:06

Dave Bish


1 Answers

There's a surprisingly simple solution to your problem using the notion of compressibility.

If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.

Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.

After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)

Here are some useful links about compressibility:

http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf

http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf

You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.

like image 194
FRiverai Avatar answered Nov 10 '22 01:11

FRiverai