Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: Gmail's messages contain invalid HTML and random jargon

I'm creating an email-based CMS with PHP, and I'm required to use Gmail as the email service. The script is insanely simple for now, and the only problem I'm having is dealing with Gmail's email syntax.

I was expecting something a bit more manageable, like this, when getting an email:

<u>asfasfasf</u> <u style="font-style: italic;">asdfaf</u> <ustyle="font-style: italic; font-weight: bold;">asfsaf</u> asfasf <a href="http://asfasfafs">asfasf</a>
<br />
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent sodales mauris quis nisl pellentesque eleifend. Sed convallis turpis quis turpis malesuada feugiat. Fusce sed metus non orci convallis congue. Integer egestas vulputate ipsum, sed fringilla velit elementum scelerisque. Pellentesque convallis metus sit amet enim faucibus adipiscing.

But I'm getting this instead (duck and cover):

<u>asfasfasf </u><u style=3D"font-style: italic; ">asdfaf =A0</u><u style=
=3D"font-style: italic; font-weight: bold; ">asfsaf </u>asfasf <a href=3D"h=
ttp://asfasfafs">asfasf</a><div><br></div><div><meta http-equiv=3D"content-=
type" content=3D"text/html; charset=3Dutf-8"><span class=3D"Apple-style-spa=
n" style=3D"font-family: Arial, Helvetica, sans; font-size: 11px; "><p styl=
e=3D"text-align: justify; font-size: 11px; line-height: 14px; margin-top: 0=
px; margin-right: 0px; margin-bottom: 14px; margin-left: 0px; padding-top: =
0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; ">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent sodales m=
auris quis nisl pellentesque eleifend. Sed convallis turpis quis turpis mal=
esuada feugiat. Fusce sed metus non orci convallis congue. Integer egestas =
vulputate ipsum, sed fringilla velit elementum scelerisque. Pellentesque co=
nvallis metus sit amet enim faucibus adipiscing.</p>
</span>

I tried Tidy, but it can't deal with Gmail's links and 'line breaks'. The breaks are just = at the end, which completely mess up Tidy, and the links are sometimes (at random, I think) like this: <a href=3D"http://asfasfafs">asfasf</a>, with those =\n right in the middle!

How would I train Tidy to deal with this sort of blasphemous HTML and output something I can pipe directly into a <div> inside of a website?

Thanks!

like image 508
Blender Avatar asked Dec 06 '10 18:12

Blender


1 Answers

That looks like quoted-printable encoding. You should be checking the "Content-Transfer-Encoding:" header line of the message to see if there's any encoding present (such as base-64 or quoted-printable) and removing the encoding before trying to parse the content.

like image 140
David Gelhar Avatar answered Nov 05 '22 18:11

David Gelhar