I am trying to learn Python, so I thought I would start by trying to query IMDB to check my movie collection against IMDB; which was going well 😊
What I am stuck on is how to handle special characters in names, and encode the name to something a URL will respect.
For example I have the movie Brüno
If I encode the string using urllib.parse.quote
I get - Bru%CC%88no
which means when I query IMDB using OMDBAPI it fails to find the movie. If I do the search via the OMDBAPI site, they encode the name as Br%C3%BCno
and this search works.
I am assuming that the encode is using a different standard, but I can’t work out what I need to do
A URL is composed of a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ). When these characters are not used in their special role inside a URL, they must be encoded.
Special characters needing encoding are: ':' , '/' , '?' , '#' , '[' , ']' , '@' , '!' , '$' , '&' , "'" , '(' , ')' , '*' , '+' , ',' , ';' , '=' , as well as '%' itself.
It is using the same encoding, but using different normalizations.
>>> import unicodedata
>>> "Brüno".encode("utf-8")
b'Bru\xcc\x88no'
>>> unicodedata.normalize("NFC", "Brüno").encode("utf-8")
b'Br\xc3\xbcno'
Some graphemes (things you see as one "character"), especially those with diacritics can be made from different characters. An "ü" can either be a "u", with a combining diaresis, or the character "ü" itself (the combined form). Combined forms don't exist for every combination of letter and diacritic, but they do for commonly used ones (= those existing in common languages).
Unicode normalization transforms all characters that form graphemes into either combined or seperate characters. The normalization method "NFC", or Normalization Form Canonical Composition, combines characters as far as possible.
In comparison, the other main form, Normalization Form Canonical Decomposition, or "NFD" will produce your version:
>>> unicodedata.normalize("NFD", "Brüno").encode("utf-8")
b'Bru\xcc\x88no'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With