How to detect keyword stuffing?

Question

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.

We have noticed, that there is keyword-stuffing abuse.

We have determined two main kinds of abuse:

Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse

These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.

Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.

My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?

Any guidance would be appreciated!

FRiverai · Accepted Answer

There's a surprisingly simple solution to your problem using the notion of compressibility.

If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.

Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.

After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)

Here are some useful links about compressibility:

http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf

http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf

You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.

How to detect keyword stuffing?

Tags:

c#

search

freetext

Dave Bish

1 Answers

FRiverai

Recent Activity

Donate For Us

How to detect keyword stuffing?

Tags:

c#

search

freetext

Dave Bish

1 Answers

FRiverai

Related questions

Recent Activity

Donate For Us