Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Canonicalisation of usernames [closed]

What is the best way to get a canonical representation of a username that is idempotent?

I want to avoid having the same issue as Spotify: http://labs.spotify.com/2013/06/18/creative-usernames/

I'm looking for a good library to do this in Python. I would prefer not to do what Spotify ended up doing (running the canonicalisation twice to test if it is idempotent), and importing Twisted into my project is a tad overkill, is there a stand-alone library for this?

Would using email addresses instead be preferred when it comes to usernames? How do major sites/companies deal with this?

like image 600
X-Istence Avatar asked Nov 13 '22 01:11

X-Istence


1 Answers

First your should read Wikipedia's article on Unicode equivalence. It explains the caveats and which normalization methods there are to represent an Unicode string in its canonical form.

Then you can use Python's built-in module unicodedata to do the normalization of the Unicode string to your preferred normalization form.

A code example:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', u'ffñⅨffi⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ')
'ffñIXffi5KaÅéBIGBIRD'
>>> unicodedata.normalize('NFKC', u'ffñⅨffi⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ').lower()
'ffñixffi5kaåébigbird'
like image 170
Daniel Jonsson Avatar answered Nov 15 '22 02:11

Daniel Jonsson