The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout".
Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh
Is there any software that does this already (preferably free and open source) ?
If not, is there an active FOSS project whose goal is to achieve this?
If not, how would you suggest to implement such a software?
If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.
If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.
Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).
Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With