I had no luck at finding any package like that, optimally in Python. Is there some library allowing one to graphically compare two strings?
It would, for instance, be helpful to fight against spam, when one uses я
instead of R
, or worse, things like Α
(capital alpha, 0x0391) instead of A
, to obfuscate their strings.
The interface to such a package could be something like
distance("Foo", "Bar") # large distance
distance("Αяe", "Are") # small distance
Thanks!
I'm not aware of a package that does this. However, you may be able to use tools like the homoglyph attack generator, the Unicode Consortium's confusables, references from wikipedia's page on the IDN homograph attack, or other such resources to build your own library of look-alikes and build a score based on that.
EDIT: It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. It's available here. If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. An excerpt is:
FF4A ; 006A ; MA # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ; 006A ; MA # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J #
1D423 ; 006A ; MA # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J #
1D457 ; 006A ; MA # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J #
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With