Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check for commonly mis-recognized characters in a string against a list of known strings

Tags:

php

php-5.5

Background

I have a list of codes in my (MySQL) database that consist of six (6) characters. They consist of numbers and letters chosen at random. They are considered case-insensitive, however they are stored as uppercase in the database. They may consist of the number 0 but never the letter O. I use these code as one-off authentication of users.

The Problem

The codes have been handwritten on cards and unfortunately some letters and numbers may look alike to some individuals. This is why I initially didn't use the letter O because of it's close appearance to the handwritten 0.

What I've done so far

I am able to check a code (case-insensitivly) against user input and determine if it is an exact match. If it's not I silently replace any O's with 0's and try again.

Question

My question is, how can i do this for other letter and numbers, such as those that I have listed below, and still be relatively confident I'm not authenticating a user as someone they are not? In this case, both characters can exist in a code. I have looked at the Levenshtein function in PHP (http://php.net/manual/en/function.levenshtein.php) as well as similar_text() (http://php.net/manual/en/function.similar-text.php) but neither is quite what I want so I'm thinking I might have to roll my own (possibly using them) to achieve this.

Similar characters:

S <=> 5
G <=> 6
I <=> 1
like image 430
alecho Avatar asked Mar 19 '23 18:03

alecho


1 Answers

The problem you're describing is really hash collisions. You have multiple possible input values, and you want them to resolve down into a single unambiguous key. I have a couple thoughts here.

As @bishop suggested, what you really need to determine is if any given input is unambiguous or not. My approach would be slightly different though:

For any given input, I would generate a list of all possible matching keys, and query the database for the entire list. If only one result is returned, then there is no problem and you can proceed based on that single record. It doesn't matter in this case if the user enter ABCDE5 or ABCDES because there's only one possible match in the database for either one.

In the event that more than one result is returned however, you have no way of determining if the user's input was accurate or if it was mis-keyed.

(In hindsight, it would have been best to design the keys so that none of the ambiguous character pairs were possible. Only allowing "S" and disallowing "5", for example, allows you to guarantee there will only ever be a single match for any given input, whether the user types "S" or "5", because you could always safely convert any 5's you see in input to S's knowing that they were input errors. In fact, depending on the exact values, you may be able to retroactively modify many or all of the keys in the database to follow this rule and make lookups less cumbersome.)

Anyway, in that ambiguous case, I don't think you don't have any choice but to push back to the user and ask them to re-check their input, hopefully explaining the possible gotchas in an on-screen message.

EDIT:

Here's an example for generating the possible values a user meant to enter based on the single input they actually provided:

<?php

$inputs = [
        'ABCDEF', // No ambiguity, DB should return 0 or 1 match.
        'AAAAA1', // One ambiguous char, user could have meant `AAAAAI`
                  // instead so search DB for both.
        '156ISG', // Worst case. If the DB values overlap a lot, there
                  // wouldn't be much hope of "guessing" what the user
                  // actually meant.
];

foreach ($inputs as $input) {
    print_r(generatePossibleMatches($input));
}

//----------------------------------------
function generatePossibleMatches($input) {
    $input = strtoupper($input);
    $ambiguous = [
        'I' => '1',
        'G' => '6',
        'S' => '5',
    ];
    $possibles = [$input];
    foreach ($ambiguous as $letter => $number) {
        foreach ($possibles as $possible) {
            foreach (str_split($possible) as $pos => $char) {
                $addNumber = substr_replace($possible, $number, $pos, 1);
                $addLetter = substr_replace($possible, $letter, $pos, 1);
                if ($char === $letter && !in_array($addNumber, $possibles)) {
                    $possibles[] = $addNumber;
                }
                if ($char === $number && !in_array($addLetter, $possibles)) {
                    $possibles[] = $addLetter;
                }
            }
        }
    }
    return $possibles;
}
like image 100
beporter Avatar answered Apr 07 '23 23:04

beporter