Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the encoding of the body of Gmail message? How to decode it?

I am using the Python API for Gmail. I am querying for some messages and retrieving them correctly, but the body of the messages looks like total nonsense, even when the MIME type it's said to be text/plain or text/html.

I have been searching all over the API docs, but they keep saying it's a string, when it obviously must be some encoding... I thought it could be base64 encoding, but trying to decode it with Python base64 gives me TypeError: Incorrect padding, so either it's not base64 or I'm decoding badly.

I'd love to provide a good example, but since I'm handling sensitive information I'll have to obfuscate it a bit...

{
 "payload": {
  "mimeType": "multipart/mixed",
  "filename": "",
  "headers": [
   ...
  ],
  "body": {
   "size": 0
  },
  "parts": [
   {
    "mimeType": "multipart/alternative",
    "filename": "",
    "headers": [
     {
      "name": "Content-Type",
      "value": "multipart/alternative; boundary=001a1140b160adc309053bd7ec57"
     }
    ],
    "body": {
    "size": 0
    },
    "parts": [
     {
      "partId": "0.0",
      "mimeType": "text/plain",
      "filename": "",
      "headers": [
       {
        "name": "Content-Type",
        "value": "text/plain; charset=UTF-8"
       },
       {
        "name": "Content-Transfer-Encoding",
        "value": "quoted-printable"
       }
      ],
      "body": {
           "size": 4067,
           "data": "LS0tLS0tLS0tLSBGb3J3YXJkZWQgbWVzc2FnZSAtLS0tLS0tLS0tDQpGcm9tOiBMaW5rZWRJbiA8am9iLWFwcHNAbGlua2VkaW4uY29tPg0KRGF0ZTogU2F0LCBTZXAgMywgMjAxNiBhdCA5OjMwIEFNDQpTdWJqZWN0OiBBcHBsaWNhdGlvbiBmb3IgU2VuaW9yIEJhY2tlbmQgRGV2ZWxvcG..."
      }

The field that I'm talking about is payload.parts[0].parts[0].body.data. I have truncated it at a random point, so I doubt is decodable like that, but you get the point... What is that encoding?

Also, wouldn't hurt to know where in the docs they explicitly say its base64 (unless it's the standard encoding for MIME?).

UPDATE: So in the end there was some bad luck involved. I have 5 mails like this, and turns out that the first one is malformed, for some unknown reason. After moving on to the other ones, I am able to decode all of them with the suggested approaches in the answers. Thank you all!

like image 292
houcros Avatar asked Sep 07 '16 14:09

houcros


4 Answers

Important distinction, it is web safe base64 encoded (aka "base64url") . The docs are not very good on it, the MessagePartBody is best documented here: https://developers.google.com/gmail/api/v1/reference/users/messages/attachments

And it says the type is "bytes" (which obviously isn't save to transmit over JSON as-is), but I agree with you, it doesn't clearly specify it's base64url encoded like other "bytes" fields are in the API.

As for padding issues, is it because you're truncating? If not, check that "len(data) % 4 == 0", if not, it means the API is returning unpadded data, which would be unexpected.

like image 63
Eric D Avatar answered Nov 19 '22 23:11

Eric D


This is base64.

Your truncated message is:

---------- Forwarded message ----------
From: LinkedIn <[email protected]>
Date: Sat, Sep 3, 2016 at 9:30 AM
Subject: Application for Senior Backend Develop

Here's some sample code:

I had to remove the last 3 characters from your truncated message because I was getting the same padding error as you. You probably have some garbage the message you're trying to decode.

import base64

body = "LS0tLS0tLS0tLSBGb3J3YXJkZWQgbWVzc2FnZSAtLS0tLS0tLS0tDQpGcm9tOiBMaW5rZWRJbiA8am9iLWFwcHNAbGlua2VkaW4uY29tPg0KRGF0ZTogU2F0LCBTZXAgMywgMjAxNiBhdCA5OjMwIEFNDQpTdWJqZWN0OiBBcHBsaWNhdGlvbiBmb3IgU2VuaW9yIEJhY2tlbmQgRGV2ZWxv"

result = base64.b64decode(body)

print(result)

UPDATE

Here's a snippet for gettting and decoding the message body. The decoding part was taken from the gMail API documentation:

  message = service.users().messages().get(userId='me', id=msg_id, format='full').execute()
  msg_str = base64.urlsafe_b64decode(message['payload']['body']['data'].encode('UTF8'))
  mime_msg = email.message_from_string(msg_str) 

  print(msg_str)

Reference doc: https://developers.google.com/gmail/api/v1/reference/users/messages/get#python

like image 27
DallaRosa Avatar answered Nov 20 '22 00:11

DallaRosa


The following worked for me:

base64.urlsafe_b64decode(body).decode("utf-8")
like image 3
Pablo Guerrero Avatar answered Nov 20 '22 00:11

Pablo Guerrero


It's base64. You can use base64.decodestring to read it. The part of the message that your attached is: '---------- Forwarded message ----------\r\nFrom: LinkedIn <[email protected]>\r\nDate: Sat, Sep 3, 2016 at 9:30 AM\r\nSubject: Application for Senior Backend Develo'

The incorrect padding error means that you're decoding an incorrect number of characters. You're probably trying to decode a truncated message.

like image 2
SurDin Avatar answered Nov 19 '22 23:11

SurDin