People search in my website and some of these searches are these ones:
tapoktrpasawe qweasd qwa as aıe qwo ıak kqw qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True is this thing working? True i hope so True t2 chhsdfitoixcv False ytjkacvzw False yutthasxcvqer False seems okay True yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With