Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:

tapoktrpasawe qweasd qwa as aıe qwo ıak kqw qwe qwe qwe a 

My question is there any way to detect strings that similar to ones above ?

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

Some new brands or products will be ignored if I will consider "regular words".

Thank you for your help

like image 358
ahe Avatar asked Jun 09 '11 19:06

ahe


2 Answers

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

For background, read about Markov Chains.

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True is this thing working? True i hope so True t2 chhsdfitoixcv False ytjkacvzw False yutthasxcvqer False seems okay True yay! True 
like image 111
Rob Neuhaus Avatar answered Sep 20 '22 17:09

Rob Neuhaus


You could do what Stackoverflow does and calculate the entropy of the string.

Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.

like image 23
BlueRaja - Danny Pflughoeft Avatar answered Sep 22 '22 17:09

BlueRaja - Danny Pflughoeft