From reading various posts, it seems like JavaScript's <code>unescape()</code> is equivalent to Pythons <code>urllib.unquote()</code>, however when I test both I get different results: <h3>In browser console:</h3> <pre class="prettyprint"><code>unescape('%u003c%u0062%u0072%u003e'); </code></pre> output: <code> </code> <h3>In Python interpreter:</h3> <pre class="prettyprint"><code>import urllib urllib.unquote('%u003c%u0062%u0072%u003e') </code></pre> output: <code>%u003c%u0062%u0072%u003e</code> I would expect Python to also return <code> </code>. Any ideas as to what I'm missing here? Thanks!

<code>%uxxxx</code> is a non standard URL encoding scheme that is not supported by <code>urllib.parse.unquote()</code> (Py 3) / <code>urllib.unquote()</code> (Py 2). It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC. You could use a regular expression to convert such codepoints: <pre class="prettyprint"><code>try: unichr # only in Python 2 except NameError: unichr = chr # Python 3 re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted) </code></pre> This decodes both the <code>%uxxxx</code> and the <code>%uxx</code> form ECMAScript 3rd ed can decode. Demo: <pre class="prettyprint"><code>>>> import re >>> quoted = '%u003c%u0062%u0072%u003e' >>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted) ' ' >>> altquoted = '%u3c%u0062%u0072%u3e' >>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted) ' ' </code></pre> but you should avoid using the encoding altogether if possible.

Javascript unescape() vs. Python urllib.unquote()

In browser console:

unescape('%u003c%u0062%u0072%u003e');

output:  

In Python interpreter:

import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')

output: %u003c%u0062%u0072%u003e

I would expect Python to also return  . Any ideas as to what I'm missing here?

Thanks!

572

asked Apr 18 '14 17:04

Michael Gradek

1 Answers

%uxxxx is a non standard URL encoding scheme that is not supported by urllib.parse.unquote() (Py 3) / urllib.unquote() (Py 2).

It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC.

You could use a regular expression to convert such codepoints:

try:
    unichr  # only in Python 2
except NameError:
    unichr = chr  # Python 3

re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)

This decodes both the %uxxxx and the %uxx form ECMAScript 3rd ed can decode.

Demo:

>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted)
'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted)
'<br>'

but you should avoid using the encoding altogether if possible.

119

answered Oct 23 '22 09:10

Martijn Pieters

Related questions
                            
                                Node.js - call a method after another method is fully executed
                            
                                Manually put pcm data into AudioBuffer
                            
                                Disable horizontal repeating of world map with mapbox
                            
                                How to find out if a variable exists or not in Dart
                            
                                alternative to async: false ajax
                            
                                How to hide and display asp:buttons in asp.net from code behind?
                            
                                Polymer - Iterating over object in template
                            
                                How to blur Selectize.js input after selection has been made in Bootstrap 3?
                            
                                Autocompleting only the place name with Google places API
                            
                                Javascript/HTML5: get current time of audio tag
                            
                                How to prevent IE11 pop up (Are you sure you want to leave this page)
                            
                                Autofilling state and city based on zip code
                            
                                Needed canvas blurring tool
                            
                                Rock, Paper, Scissors, Lizard, Spock in JavaScript
                            
                                download img throught hyperlink <a> in IE11 using javascript
                            
                                Web scraping a website with dynamic javascript content
                            
                                Async function nested within async.js waterfall
                            
                                Do ng-click, ng-mouseover etc create watchers and slow down the page? Is it better than jQuery event binding?
                            
                                nvd3.js - unable to change color of line in line chart
                            
                                Focus to an image of a HTML page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Javascript unescape() vs. Python urllib.unquote()

Tags:

python

javascript

escaping

urllib