I have a string (base64 decoded here) that looks like this:
----------------------------212550847697339237761929
Content-Disposition: form-data; name="preferred_name"; filename="file1.rtf"
Content-Type: application/rtf
{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 testing123FILE1}
----------------------------212550847697339237761929
Content-Disposition: form-data; name="to_process"; filename="file2.rtf"
Content-Type: application/rtf
{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 testing123FILE212341234}
----------------------------212550847697339237761929--
I generate this on a simple webpage that uploads a couple files to a AWS Lambda script through a PUT request with the API Gateway. It should be noted that what I get from the API Gateway is a Base64 string that I then decode into the string above.
The string above is the data that my Lambda script receives from the API gateway. What I would like to do is parse this string in order to retrieve the data contained within with Python 2.7. I've experimented with the cgi
class and using the cgi.parse_multipart()
method, however, I cannot find a way to convert a string to the required arguments. Any tips?
The Multipart parser is used for working with the request body in the multipart format. This parser creates a hash table where the names of the request body parameters are the keys and the values of the corresponding parameters are the hash table values.
Multipart form data: The ENCTYPE attribute of <form> tag specifies the method of encoding for the form data. It is one of the two ways of encoding the HTML form. It is specifically used when file uploading is required in HTML form. It sends the form data to server in multiple parts because of large size of file.
Each item in a multipart message is separated by a boundary marker. Webkit based browsers put "WebKitFormBoundary" in the name of that boundary. The Network tab of developer tools do not show file data in a multipart message report: They can be too big.
Class MultipartRequest. A utility class to handle multipart/form-data requests, the kind of requests that support file uploads. This class emulates the interface of HttpServletRequest , making it familiar to use. It uses a "push" model where any incoming files are read and saved directly to disk in the constructor.
Comment: is it robust and spec compliant?
As long as your Data follow this Preconditions:
Comment: What if the content is binary like a JPEG stream?
This is likly to break as there are String Methodes used and reading the content is using .readline()
which depends on New Line.
Therefore to decode
from BASE64 and then unpack
Multipart are the wrong Approach!
Comment: If there's a version reusing a common library
If you are able to provide your Data as Standard MIME Message you can use the following:
import email
msg = email.message_from_string(mimeHeader+data)
print('is_multipart:{}'.format(msg.is_multipart()))
for part in msg.walk():
if part.get_content_maintype() == 'multipart':
continue
filename = part.get_filename()
payload = part.get_payload(decode=True)
print('{} filename:{}\n{}'.format(part.get_content_type(), filename, payload))
Output:
is_multipart:True application/rtf filename:file1.rtf b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity) application/rtf filename:file2.rtf b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)
Question: Parse a string of multipart data
Pure Python Solution, for instance:
import re, io
with io.StringIO(data) as fh:
parts = []
part_line = []
part_fname = None
new_part = None
robj = re.compile('.+filename=\"(.+)\"')
while True:
line = fh.readline()
if not line: break
if not new_part:
new_part = line[:-1]
if line.startswith(new_part):
if part_line:
parts.append({'filename':part_fname, 'content':''.join(part_line)})
part_line = []
while line and line != '\n':
_match = robj.match(line)
if _match: part_fname = _match.groups()[0]
line = fh.readline()
else:
part_line.append(line)
for part in parts:
print(part)
Output:
{'filename': 'file1.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity) {'filename': 'file2.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)
Tested with Python: 3.4.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With