I’m having a crack at profanity filtering for a web forum written in Python.
As part of that, I’m attempting to write a function that takes a word, and returns all possible mock spellings of that word that use visually similar characters in place of specific letters (e.g. s†å©køv€rƒ|øw).
I expect I’ll have to expand this list over time to cover people’s creativity, but is there a list floating around anywhere on the internet that I could use as a starting point?
In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar.
For example, the characters used by the English language consist of the letters of the alphabet, numerals, punctuation marks and a variety of symbols (e.g., the ampersand, the dollar sign and the arithmetic symbols).
There are 33 characters classified as ASCII Punctuation & Symbols are also sometimes referred to as ASCII special characters.
This is probably both vastly more deep than you need, yet not wide enough to cover your use case, but the Unicode consortium have had to deal with attacks against internationalised domain names and came up with this list of homographs (characters with the same or similar rendering):
http://www.unicode.org/Public/security/latest/confusables.txt
Might make a starting point at least.
http://en.wikipedia.org/wiki/Letterlike_Symbols
It's much much much less comprehensive but is more comprehensible.
I created a python class to do exactly this, based on Robin's unicode link for "confusables"
https://github.com/wanderingstan/Confusables
For example, "Hello" would get expanded into the following set of regexp character classes:
[H\H\ℋ\ℌ\ℍ\𝐇\𝐻\𝑯\𝓗\𝕳\𝖧\𝗛\𝘏\𝙃\𝙷\Η\𝚮\𝛨\𝜢\𝝜\𝞖\Ⲏ\Н\Ꮋ\ᕼ\ꓧ\𐋏\Ⱨ\Ң\Ħ\Ӊ\Ӈ]
[e\℮\e\ℯ\ⅇ\𝐞\𝑒\𝒆\𝓮\𝔢\𝕖\𝖊\𝖾\𝗲\𝘦\𝙚\𝚎\ꬲ\е\ҽ\ɇ\ҿ]
[l\\|\∣\⏽\│1\\۱\𐌠\\𝟏\𝟙\𝟣\𝟭\𝟷I\I\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\l\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\\\\\\\\\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\\\ł\ɭ\Ɨ\ƚ\ɫ\\\\\ŀ\Ŀ\ᒷ\🄂\⒈\\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\lj\IJ\‖\∥\Ⅱ\ǁ\\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙]
[l\\|\∣\⏽\│1\\۱\𐌠\\𝟏\𝟙\𝟣\𝟭\𝟷I\I\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\l\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\\\\\\\\\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\\\ł\ɭ\Ɨ\ƚ\ɫ\\\\\ŀ\Ŀ\ᒷ\🄂\⒈\\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\lj\IJ\‖\∥\Ⅱ\ǁ\\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙]
[o\ం\ಂ\ം\ං\०\੦\૦\௦\౦\೦\൦\๐\໐\၀\\۵\o\ℴ\𝐨\𝑜\𝒐\𝓸\𝔬\𝕠\𝖔\𝗈\𝗼\𝘰\𝙤\𝚘\ᴏ\ᴑ\ꬽ\ο\𝛐\𝜊\𝝄\𝝾\𝞸\σ\𝛔\𝜎\𝝈\𝞂\𝞼\ⲟ\о\ჿ\օ\\\\\\\\\\\\\\\\\\\\\ഠ\ဝ\𐓪\𑣈\𑣗\𐐬\\ø\ꬾ\ɵ\ꝋ\ө\ѳ\ꮎ\ꮻ\ꭴ\\ơ\œ\ɶ\∞\ꝏ\ꚙ\ൟ\တ]
This regexp will match against "𝓗℮𝐥1೦"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With