Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode encodeURIComponent in GAE (python)?

I have a unicode string that was encoded on the client side using JS encodeURIComponent.

If I use the following in Python locally, I get the expected result:

>>> urllib.unquote("Foo%E2%84%A2%20Bar").decode("utf-8")
>>> u'Foo\u2122 Bar'

But when I run this in Google App Engine, I get:

Traceback (most recent call last):
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 703, in __call__
    handler.post(*groups)
  File "/base/data/home/apps/s~kaon-log/2.357769827131038147/main.py", line 143, in post
    path_uni = urllib.unquote(h.path).decode('utf-8')
  File "/base/python_runtime/python_dist/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)

I'm still using Python 2.5, in case that makes a difference. What am I missing?

like image 936
Joshua Smith Avatar asked Mar 26 '12 21:03

Joshua Smith


People also ask

How do you decode a string in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

What is encodeURIComponent?

The encodeURIComponent() function encodes a URI by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).

What does decode URI component do?

The decodeURIComponent() function decodes a Uniform Resource Identifier (URI) component previously created by encodeURIComponent or by a similar routine.

What does unquote do in Python?

unquote() This function replaces %xx escapes by their single-character equivalent.


1 Answers

My guess is that h.path is a unicode object. Then urllib.unquote would return a unicode object. When decode is called on a unicode object at first it is converted to str using default encoding (which is ascii) and here you get the 'ascii' codec can't encode exception.

Here is a proof:

>>> urllib.unquote(u"Foo%E2%84%A2%20Bar").decode("utf-8")
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)

This should work:

urllib.unquote(h.path.encode('utf-8')).decode("utf-8")

There is a stackoverflow thread which explains why unicode doesn't work with urllib.unquote: How to unquote a urlencoded unicode string in python?

like image 64
Ski Avatar answered Sep 21 '22 17:09

Ski