Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do multi-part upload with Python requests library AND unicode filename?

I'm trying to upload a file to JIRA using its REST API. Uploads are done as multi-part POST. The code works fine so long as the filenames are ASCII. If they contain any non-ASCII characters, JIRA gives a 500 error.

I've tried to correct that by encoding the filename in the POST operation and the file upload succeeds but JIRA displays the ASCII version of the encoded filename, rather than decoding the filename into the original UTF-8.

Here is a snippet of the Python code:

files = {'file': (urllib2.quote(zd_attachment['file_name'].encode('utf8')), open('/tmp/%s' % zd_attachment['file_name'], 'rb'), zd_attachment['content_type'])}
mp_header = {'X-Atlassian-Token': 'nocheck'}
r = requests.post("%s/rest/api/2/issue/%s/attachments" % (jira_url, issue_id), headers=mp_header, files=files, auth=jira_auth)

Am I using the wrong method to encode the filename? Have I missed something else out in order to correctly handle the Unicode filename?

like image 246
Philip Colmer Avatar asked Jan 08 '23 23:01

Philip Colmer


2 Answers

Urllib3 uses the wrong method of encoding for non-ASCII filenames (issue, pull request). A workaround is to create your own PreparedRequest or implement an auth handler and change the body of the request just before sending it. I have some examples in my blog post about this.

like image 59
Sjoerd Avatar answered Jan 16 '23 21:01

Sjoerd


Am I using the wrong method to encode the filename?

The bad news is, there isn't really any widely-accepted method of encoding non-ASCII and other troublesome characters (like quotes) into filenames. The Content-Disposition; filename parameter is a thing of woeful unreliability on the web.

requests/urllib3 tries so hard to be good. If a filename contains non-ASCII characters it tries to encode them using the crazy filename*= header parameter scheme put forward for MIME by RFC 2231 (source: urllib3.fields.format_header_param).

This is in fact the right thing to do for the Content-Disposition header in HTTP file download responses (per RFC5987), and some browsers even support that now, but form submissions aren't covered by this.

Nothing on the server side supports this form of encoding, so I suspect JIRA is probably seeing a file upload field without a filename (because it only has a filename*) and throwing its toys out of the pram.

HTML5, which has taken over definition of the multiple/form-data media type, now disowns this encoding scheme:

User agents must not use the RFC 2231 encoding suggested by RFC 2388.

Instead:

File names included in the generated multipart/form-data resource (as part of file fields) must use the character encoding selected above

So you should be passing in a Unicode string for the filename and urllib3 should encode that to the same encoding used for the rest of the form (eg UTF-8). That you can't is IMO a bug that should be reported against urllib3. For the moment you may have to settle for ASCII-only filenames.

Aside: unfortunately we are still far from salvation here. As well as legacy browsers and their arbitrary encodings, HTML5 still keeps life nice and woolly by allowing browsers to apply additional arbitrary unrecoverable mangling:

the precise name may be approximated if necessary (e.g. newlines could be removed from file names, quotes could be changed to "%22", and characters not expressible in the selected character encoding could be replaced by other characters).

like image 26
bobince Avatar answered Jan 16 '23 21:01

bobince