Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What string distance algorithm is best for measuring typing accuracy?

I'm trying to write a function that detects how accurate the user typed a particular phrase/sentence/word/words. My objective is to build an app to train the user's typing accuracy of certain phrases.

My initial instinct is to use the basic levenshtein distance algorithm (mostly because that's the only algo I knew off the top of my head).

But after a bit more research, I saw that Jaro-Winkler is a slightly more interesting algorithm because of its consideration for transpositions.

I even found a link that talks about the differences between these algorithms:

Difference between Jaro-Winkler and Levenshtein distance?

Having read all that, in addition to the respective Wikipedia posts, I am still a little clueless as to which algorithm fits my objective the best.

like image 983
adrianmc Avatar asked Jan 11 '17 21:01

adrianmc


3 Answers

Since you are grading the quality of typing, and you want to train the student to make zero mistakes, you should use Levenshtein distance, because it is less forgiving.

Additionally, Levenshtein score is more intuitive to understand, and easier to represent graphically, than the Jaro-Winkler results. You can modify Levenshtein algorithm to report insertions, deletions, and mistypes separately, and show end-users a list of corrections. Jaro-Winkler, on the other hand, gives you a score that is hard to show to end-user, because penalties for misspelling in the middle are lower than penalties at the end.

like image 97
Sergey Kalinichenko Avatar answered Nov 15 '22 07:11

Sergey Kalinichenko


Slightly tongue-in-cheek, but only slightly: build a generative model for typing that gives high (prior) probability to hitting the right letter, and apportion out some probabilities for hitting two neighboring keys at once, two keys from different hands in the wrong order, two keys from the same hand in the wrong order, a key near the correct one, a key far from the correct one, etc. Or perhaps less ad-hoc: give your model a probability for a given sequence of keypresses given the current pair of keys needed to continue the passage. You could do a lot of things with such a model; for example, you could get a "distance"-like metric by giving a likelihood score for the learner's actual performance. But even better would be to give them a report summarizing which kinds of errors they make the most -- after all, why boil their performance down to a single number when many numbers would do? Bonus points if you learn the probabilities for the different kinds of errors from a large corpus of real typists' work.

like image 22
Daniel Wagner Avatar answered Nov 15 '22 08:11

Daniel Wagner


I mostly agree with the answer given by dasblinkenlight, however, would suggest to use the Damerau-Levenshtein distance instead of only Levenshtein, that is, including transpositions. Transpositions are fairly frequent and easy to make while typing, and there is no good reason why they should incur a double distance penalty with respect to the other possible errors (insertions, deletions, and substitutions).

like image 44
fnl Avatar answered Nov 15 '22 07:11

fnl