Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get non-ASCII filename from S3 notification event in Lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.

This is evident when the filename contains spaces or non-ASCII characters.

For example, I have upload the following filename to S3:

my file řěąλλυ.txt

The notification is received as:

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}

I've tried to decode using:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

but that yields:

my file ÅÄÄλλÏ.txt

Of course, when I then try to get the file from S3 using Boto, I get a 404 error.

like image 359
Alastair McCormack Avatar asked Sep 13 '16 08:09

Alastair McCormack


People also ask

How do I transfer files from S3 bucket to Lambda?

Add below Bucket Access policy to the IAM Role created in Destination account. Lambda function will assume the Role of Destination IAM Role and copy the S3 object from Source bucket to Destination. In the Lambda console, choose Create a Lambda function. Directly move to configure function.

Can Lambda read from S3?

The S3 object key and bucket name are passed into your Lambda function via the event parameter. You can then get the object from S3 and read its contents.


2 Answers

tl;dr

You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.

For example, for an S3 object with the filename: my file řěąλλυ.txt:

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file řěąλλυ.txt

Background

AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

The error you should've got from your decode() is:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)

The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.

Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.

like image 139
Alastair McCormack Avatar answered Oct 24 '22 17:10

Alastair McCormack


For python 3:

from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)

# will prints 'input/пустой.pdf'
like image 42
valex Avatar answered Oct 24 '22 18:10

valex