Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding special characters for passing to a URL

I am trying to learn Python, so I thought I would start by trying to query IMDB to check my movie collection against IMDB; which was going well 😊

What I am stuck on is how to handle special characters in names, and encode the name to something a URL will respect.

For example I have the movie Brüno

If I encode the string using urllib.parse.quote I get - Bru%CC%88no which means when I query IMDB using OMDBAPI it fails to find the movie. If I do the search via the OMDBAPI site, they encode the name as Br%C3%BCno and this search works.

I am assuming that the encode is using a different standard, but I can’t work out what I need to do

like image 511
PhilC Avatar asked Mar 22 '19 14:03

PhilC


People also ask

Can you have special characters in a URL?

A URL is composed of a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ). When these characters are not used in their special role inside a URL, they must be encoded.

Which characters should be encoded in URL?

Special characters needing encoding are: ':' , '/' , '?' , '#' , '[' , ']' , '@' , '!' , '$' , '&' , "'" , '(' , ')' , '*' , '+' , ',' , ';' , '=' , as well as '%' itself.


1 Answers

It is using the same encoding, but using different normalizations.

>>> import unicodedata
>>> "Brüno".encode("utf-8")
b'Bru\xcc\x88no'
>>> unicodedata.normalize("NFC", "Brüno").encode("utf-8")
b'Br\xc3\xbcno'

Some graphemes (things you see as one "character"), especially those with diacritics can be made from different characters. An "ü" can either be a "u", with a combining diaresis, or the character "ü" itself (the combined form). Combined forms don't exist for every combination of letter and diacritic, but they do for commonly used ones (= those existing in common languages).

Unicode normalization transforms all characters that form graphemes into either combined or seperate characters. The normalization method "NFC", or Normalization Form Canonical Composition, combines characters as far as possible.

In comparison, the other main form, Normalization Form Canonical Decomposition, or "NFD" will produce your version:

>>> unicodedata.normalize("NFD", "Brüno").encode("utf-8")
b'Bru\xcc\x88no'
like image 186
L3viathan Avatar answered Oct 17 '22 20:10

L3viathan