Parse a string of multipart data

Tags:

I have a string (base64 decoded here) that looks like this:

----------------------------212550847697339237761929
Content-Disposition: form-data; name="preferred_name"; filename="file1.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE1}
----------------------------212550847697339237761929
Content-Disposition: form-data; name="to_process"; filename="file2.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE212341234}
----------------------------212550847697339237761929--

I generate this on a simple webpage that uploads a couple files to a AWS Lambda script through a PUT request with the API Gateway. It should be noted that what I get from the API Gateway is a Base64 string that I then decode into the string above.

The string above is the data that my Lambda script receives from the API gateway. What I would like to do is parse this string in order to retrieve the data contained within with Python 2.7. I've experimented with the cgi class and using the cgi.parse_multipart() method, however, I cannot find a way to convert a string to the required arguments. Any tips?

322

asked Jul 11 '17 02:07

Tanishq dubey

1 Answers

Comment: is it robust and spec compliant?

As long as your Data follow this Preconditions:

The First line is the boundary
The Following Header is terminated with a empty Line
Each Message Part is terminated with the boundary

Comment: What if the content is binary like a JPEG stream?

This is likly to break as there are String Methodes used and reading the content is using .readline() which depends on New Line.
Therefore to decode from BASE64 and then unpack Multipart are the wrong Approach!

Comment: If there's a version reusing a common library

If you are able to provide your Data as Standard MIME Message you can use the following:

import email
msg = email.message_from_string(mimeHeader+data)
print('is_multipart:{}'.format(msg.is_multipart()))

for part in msg.walk():
    if part.get_content_maintype() == 'multipart':
        continue

    filename = part.get_filename()
    payload = part.get_payload(decode=True)
    print('{} filename:{}\n{}'.format(part.get_content_type(), filename, payload))

Output:

is_multipart:True
application/rtf filename:file1.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)
application/rtf filename:file2.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)

Question: Parse a string of multipart data

Pure Python Solution, for instance:

import re, io
with io.StringIO(data) as fh:
    parts = []
    part_line = []
    part_fname = None
    new_part = None
    robj = re.compile('.+filename=\"(.+)\"')

    while True:
        line = fh.readline()
        if not line: break

        if not new_part:
            new_part = line[:-1]

        if line.startswith(new_part):
            if part_line:
                parts.append({'filename':part_fname, 'content':''.join(part_line)})
                part_line = []

            while line and line != '\n':
                _match = robj.match(line)
                if _match: part_fname = _match.groups()[0]
                line = fh.readline()
        else:
            part_line.append(line)

for part in parts:
    print(part)

Output:

{'filename': 'file1.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)
{'filename': 'file2.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)

Tested with Python: 3.4.2

114

answered Oct 02 '22 20:10

stovfl

Related questions
                            
                                Why Should Homebrew be used to Install Python?
                            
                                Django Abstract Models setting related_name with underscores
                            
                                Best way to override lineno in Python logger
                            
                                Maximum recursion depth error in Python when calling super's init. [duplicate]
                            
                                How do I extend UserCreationForm to include email field
                            
                                AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn
                            
                                Pandas escape carriage return in to_csv
                            
                                Image recognition using TensorFlow [closed]
                            
                                Multiply scipy.lti transfer functions
                            
                                Fix Conflicting migrations detected in Django1.9
                            
                                Repeating values in a "group by" pandas dataframe
                            
                                py.test session level fixtures in setup_method
                            
                                TypeError: decoding str is not supported
                            
                                How to override Gunicorn's logging config to use a custom formatter
                            
                                import matplotlib failing with No module named _tkinter on heroku
                            
                                How to split a numpy array in fixed size chunks with and without overlap?
                            
                                Python: Access embedded OLE from Office/Excel document without clipboard
                            
                                About tensorflow Metadata and RunOptions
                            
                                imp module is deprecated in favour of importlib
                            
                                TensorFlow Dataset Shuffle Each Epoch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse a string of multipart data

Tags:

python

multipartform-data

aws-api-gateway

Tanishq dubey

People also ask

1 Answers

stovfl

Recent Activity

Donate For Us