Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find a distance measure of graphical similarity of two strings

Tags:

python

string

I had no luck at finding any package like that, optimally in Python. Is there some library allowing one to graphically compare two strings?

It would, for instance, be helpful to fight against spam, when one uses я instead of R, or worse, things like Α (capital alpha, 0x0391) instead of A, to obfuscate their strings.

The interface to such a package could be something like

distance("Foo", "Bar")  # large distance
distance("Αяe", "Are")  # small distance

Thanks!

like image 279
tobast Avatar asked Feb 08 '18 08:02

tobast


1 Answers

I'm not aware of a package that does this. However, you may be able to use tools like the homoglyph attack generator, the Unicode Consortium's confusables, references from wikipedia's page on the IDN homograph attack, or other such resources to build your own library of look-alikes and build a score based on that.

EDIT: It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. It's available here. If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. An excerpt is:

FF4A ;  006A ;  MA  # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ;  006A ;  MA  # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 
1D423 ; 006A ;  MA  # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J  # 
1D457 ; 006A ;  MA  # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J  # 
like image 77
Richard Avatar answered Oct 04 '22 03:10

Richard