Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Upload a large XML file with Python Requests library

I'm trying to replace curl with Python & the requests library. With curl, I can upload a single XML file to a REST server with the curl -T option. I have been unable to do the same with the requests library.

A basic scenario works:

payload = '<person test="10"><first>Carl</first><last>Sagan</last></person>'
headers = {'content-type': 'application/xml'}
r = requests.put(url, data=payload, headers=headers, auth=HTTPDigestAuth("*", "*"))

When I change payload to a bigger string by opening an XML file, the .put method hangs (I use the codecs library to get a proper unicode string). For example, with a 66KB file:

xmlfile = codecs.open('trb-1996-219.xml', 'r', 'utf-8')
headers = {'content-type': 'application/xml'}
content = xmlfile.read()
r = requests.put(url, data=content, headers=headers, auth=HTTPDigestAuth("*", "*"))

I've been looking into using the multipart option (files), but the server doesn't seem to like that.

So I was wondering if there is a way to simulate curl -T behaviour in Python requests library.

UPDATE 1: The program hangs in textmate, but throws an UnicodeEncodeError error on the commandline. Seems that must be the problem. So the question would be: is there a way to send unicode strings to a server with the requests library?

UPDATE 2: Thanks to the comment of Martijn Pieters the UnicodeEncodeError went away, but a new issue turned up. With a literal (ASCII) XML string, logging shows the following lines:

2012-11-11 15:55:05,154 INFO Starting new HTTP connection (1): my.ip.address
2012-11-11 15:55:05,294 DEBUG "PUT /v1/documents?uri=/example/test.xml HTTP/1.1" 401 211
2012-11-11 15:55:05,430 DEBUG "PUT /v1/documents?uri=/example/test.xml HTTP/1.1" 201 0

Seems the server always bounces the first authentication attempt (?) but then accepts the second one.

With a file object (open('trb-1996-219.xml', 'rb')) passed to data, the logfile shows:

2012-11-11 15:50:54,309 INFO Starting new HTTP connection (1): my.ip.address
2012-11-11 15:50:55,105 DEBUG "PUT /v1/documents?uri=/example/test.xml HTTP/1.1" 401 211
2012-11-11 15:51:25,603 WARNING Retrying (0 attempts remain) after connection broken by 'BadStatusLine("''",)': /v1/documents?uri=/example/test.xml

So, first attempt is blocked as before, but no second attempt is made.

According to Martijn Pieters (below), the second issue can be explained by a faulty server (empty line). I will look into this, but if someone has a workaround (apart from using curl) I wouldn't mind hearing it.

And I am still surprised that the requests library behaves so differently for small string and file object. Isn't the file object serialized before it gets to the server anyway?

like image 311
M_breeb Avatar asked Nov 11 '12 13:11

M_breeb


People also ask

How do I write data into an XML file using Python?

Creating XML Document using Python First, we import minidom for using xml. dom . Then we create the root element and append it to the XML. After that creating a child product of parent namely Geeks for Geeks.

Can I upload XML files?

The XML Upload functionality is available to upload files for large volumes of submissions. To use the XML upload the data needs to be in the correct format i.e. XML. This is the same format that was required in previous submissions for large uploads.

What Python package creates XML?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.


2 Answers

To PUT large files, don't read them into memory. Simply pass the file as the data keyword:

xmlfile = open('trb-1996-219.xml', 'rb')
headers = {'content-type': 'application/xml'}
r = requests.put(url, data=xmlfile, headers=headers, auth=HTTPDigestAuth("*", "*"))

Moreover, you were opening the file as unicode (decoding it from UTF-8). As you'll be sending it to a remote server, you need raw bytes, not unicode values, and you should open the file as a binary instead.

like image 164
Martijn Pieters Avatar answered Oct 26 '22 09:10

Martijn Pieters


Digest authentication always requires you to make at least two request to the server. The first request doesn't contain any authentication data. This first request will fail with a 401 "Authorization required" response code and a digest challenge (called a nounce) to be used for hashing your password etc. (the exact details don't matter here). This is used to make a second request to the server containing your credentials hashed with the challenge.

The problem is in the this two step authentication: your large file was already send with the first unauthorized request (send in vain) but on the second request the file object is already at the EOF position. Since the file size was also send in the Content-length header of the second request, this causes the server to wait for a file that will never be send.

You could solve it using a requests Session and first make a simple request for authentication purposes (say a GET request). Then make a second PUT request containing the actual payload using the same digest challenge form the first request.

sess = requests.Session()
sess.auth = HTTPDigestAuth("*", "*")
sess.get(url)
headers = {'content-type': 'application/xml'}
with codecs.open('trb-1996-219.xml', 'r', 'utf-8') as xmlfile:
    sess.put(url, data=xmlfile, headers=headers)
like image 44
dieterg Avatar answered Oct 26 '22 09:10

dieterg