The <code>key</code> field in an AWS S3 notification event, which denotes the filename, is URL escaped. This is evident when the filename contains spaces or non-ASCII characters. For example, I have upload the following filename to S3: <pre class="prettyprint"><code>my file řěąλλυ.txt </code></pre> The notification is received as: <pre class="prettyprint"><code>{ "Records": [ "s3": { "object": { "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt" } } ] } </code></pre> I've tried to decode using: <pre class="prettyprint"><code>key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8') </code></pre> but that yields: <pre class="prettyprint"><code>my file ÅÄÄÎ»Î»Ï.txt </code></pre> Of course, when I then try to get the file from S3 using Boto, I get a 404 error.

<h3>tl;dr</h3> You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8. For example, for an S3 object with the filename: <code>my file řěąλλυ.txt</code>: <pre class="prettyprint"><code>>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8') # encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer. 'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt' >>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key) # the previous url-escaped UTF-8 are now converted to UTF-8 bytes # If you passed a Unicode object to unquote_plus, you'd have got a # Unicode with UTF-8 encoded bytes! 'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt' # Decodes key_utf-8 to a Unicode string >>> key = key_utf8.decode('utf-8') u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt' # Note the u prefix. The utf-8 bytes have been decoded to Unicode points. >>> type(key) <type 'unicode'> >>> print(key) my file řěąλλυ.txt </code></pre> <h3>Background</h3> AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/ The error you should've got from your <code>decode()</code> is: <pre class="prettyprint"><code>UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128) </code></pre> The value of <code>key</code> is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used. Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call <code>.decode()</code> at all.

Get non-ASCII filename from S3 notification event in Lambda

Tags:

python-unicode

utf-8

amazon-s3

python-2.7

aws-lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.

This is evident when the filename contains spaces or non-ASCII characters.

For example, I have upload the following filename to S3:

my file řěąλλυ.txt

The notification is received as:

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}

I've tried to decode using:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

but that yields:

my file ÅÄÄÎ»Î»Ï.txt

Of course, when I then try to get the file from S3 using Boto, I get a 404 error.

359

asked Sep 13 '16 08:09

Alastair McCormack

2 Answers

tl;dr

You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.

For example, for an S3 object with the filename: my file řěąλλυ.txt:

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file řěąλλυ.txt

Background

AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

The error you should've got from your decode() is:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)

The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.

Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.

139

answered Oct 24 '22 17:10

Alastair McCormack

For python 3:

from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)

# will prints 'input/пустой.pdf'

answered Oct 24 '22 18:10

valex

Related questions
                            
                                Extrapolate Pandas DataFrame
                            
                                Round robin to select elements of list
                            
                                Extracting text from multiple powerpoint files using python
                            
                                Forward slash in json file from pandas dataframe
                            
                                Airflow admin UI shows example dags
                            
                                How to call global function from class method
                            
                                Apache Airflow : airflow initdb results in "ImportError: No module named json"
                            
                                Print LIST of unicode chars without escape characters
                            
                                get first element from a list without exception
                            
                                Executing Python Script with PHP Variables
                            
                                Unable to find font cache of matplotlib on a mac
                            
                                attribute error: list object has not attribute lstrip in sending an email with attachment
                            
                                WingIDE C:\Python27 __init__.py" raise CodecRegistryError SyntaxError: invalid syntax
                            
                                Two functions in parallel with multiple arguments and return values
                            
                                Executing Django's sqlsequencereset code from within python
                            
                                Python logging across multiple modules
                            
                                How to unit test with a mocked file object in Python?
                            
                                ttk.Treeview - Can't change row height
                            
                                Python: ImportError: /usr/local/lib/python2.7/lib-dynload/_io.so: undefined symbol: PyUnicodeUCS2_Replace
                            
                                Error installing mod-wsgi on Apache server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get non-ASCII filename from S3 notification event in Lambda

Tags:

python-unicode

utf-8

amazon-s3

python-2.7

aws-lambda

Alastair McCormack

People also ask

2 Answers

tl;dr

Background

Alastair McCormack

valex

Recent Activity

Donate For Us