What is the best way to get a canonical representation of a username that is idempotent?
I want to avoid having the same issue as Spotify: http://labs.spotify.com/2013/06/18/creative-usernames/
I'm looking for a good library to do this in Python. I would prefer not to do what Spotify ended up doing (running the canonicalisation twice to test if it is idempotent), and importing Twisted into my project is a tad overkill, is there a stand-alone library for this?
Would using email addresses instead be preferred when it comes to usernames? How do major sites/companies deal with this?
First your should read Wikipedia's article on Unicode equivalence. It explains the caveats and which normalization methods there are to represent an Unicode string in its canonical form.
Then you can use Python's built-in module unicodedata to do the normalization of the Unicode string to your preferred normalization form.
A code example:
>>> import unicodedata
>>> unicodedata.normalize('NFKC', u'ffñⅨffi⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ')
'ffñIXffi5KaÅéBIGBIRD'
>>> unicodedata.normalize('NFKC', u'ffñⅨffi⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ').lower()
'ffñixffi5kaåébigbird'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With