Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse a string of multipart data

I have a string (base64 decoded here) that looks like this:

----------------------------212550847697339237761929
Content-Disposition: form-data; name="preferred_name"; filename="file1.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE1}
----------------------------212550847697339237761929
Content-Disposition: form-data; name="to_process"; filename="file2.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE212341234}
----------------------------212550847697339237761929--

I generate this on a simple webpage that uploads a couple files to a AWS Lambda script through a PUT request with the API Gateway. It should be noted that what I get from the API Gateway is a Base64 string that I then decode into the string above.

The string above is the data that my Lambda script receives from the API gateway. What I would like to do is parse this string in order to retrieve the data contained within with Python 2.7. I've experimented with the cgi class and using the cgi.parse_multipart() method, however, I cannot find a way to convert a string to the required arguments. Any tips?

like image 322
Tanishq dubey Avatar asked Jul 11 '17 02:07

Tanishq dubey


People also ask

What is a multipart parser?

The Multipart parser is used for working with the request body in the multipart format. This parser creates a hash table where the names of the request body parameters are the keys and the values of the corresponding parameters are the hash table values.

How is multipart form data encoded?

Multipart form data: The ENCTYPE attribute of <form> tag specifies the method of encoding for the form data. It is one of the two ways of encoding the HTML form. It is specifically used when file uploading is required in HTML form. It sends the form data to server in multiple parts because of large size of file.

What is WebKitFormBoundary?

Each item in a multipart message is separated by a boundary marker. Webkit based browsers put "WebKitFormBoundary" in the name of that boundary. The Network tab of developer tools do not show file data in a multipart message report: They can be too big.

How do you handle a multipart request in Java?

Class MultipartRequest. A utility class to handle multipart/form-data requests, the kind of requests that support file uploads. This class emulates the interface of HttpServletRequest , making it familiar to use. It uses a "push" model where any incoming files are read and saved directly to disk in the constructor.


1 Answers

Comment: is it robust and spec compliant?

As long as your Data follow this Preconditions:

  • The First line is the boundary
  • The Following Header is terminated with a empty Line
  • Each Message Part is terminated with the boundary

Comment: What if the content is binary like a JPEG stream?

This is likly to break as there are String Methodes used and reading the content is using .readline() which depends on New Line.
Therefore to decode from BASE64 and then unpack Multipart are the wrong Approach!


Comment: If there's a version reusing a common library

If you are able to provide your Data as Standard MIME Message you can use the following:

import email
msg = email.message_from_string(mimeHeader+data)
print('is_multipart:{}'.format(msg.is_multipart()))

for part in msg.walk():
    if part.get_content_maintype() == 'multipart':
        continue

    filename = part.get_filename()
    payload = part.get_payload(decode=True)
    print('{} filename:{}\n{}'.format(part.get_content_type(), filename, payload))

Output:

is_multipart:True
application/rtf filename:file1.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)
application/rtf filename:file2.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)

Question: Parse a string of multipart data

Pure Python Solution, for instance:

import re, io
with io.StringIO(data) as fh:
    parts = []
    part_line = []
    part_fname = None
    new_part = None
    robj = re.compile('.+filename=\"(.+)\"')

    while True:
        line = fh.readline()
        if not line: break

        if not new_part:
            new_part = line[:-1]

        if line.startswith(new_part):
            if part_line:
                parts.append({'filename':part_fname, 'content':''.join(part_line)})
                part_line = []

            while line and line != '\n':
                _match = robj.match(line)
                if _match: part_fname = _match.groups()[0]
                line = fh.readline()
        else:
            part_line.append(line)

for part in parts:
    print(part)

Output:

{'filename': 'file1.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)
{'filename': 'file2.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)

Tested with Python: 3.4.2

like image 114
stovfl Avatar answered Oct 02 '22 20:10

stovfl