Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What application does google use to show PDF attachments in gmail

Tags:

I watched the traffic when google displays PDF attachments in gmail in a new window. The content is served as PNG images for each PDF page. And its text can be selected. What does google use on server side to generate a PNG file for a particular page in a pdf file? How does the selection of text on a png file work? Any ideas?

like image 385
varun Avatar asked Apr 25 '09 18:04

varun


People also ask

How do I view PDF files in Gmail?

Type "filename:[file type]" (without the quotes) into the box. For example, if you want to find all the PDFs in your mailbox, type "filename:pdf" in the search area.

How do I change my default PDF viewer in Gmail?

Go to Settings. Go to Apps. Select the other PDF app, that always open up automatically. Scroll down to "Launch By Default" or "Open by default".

Why can I not preview PDF files in Gmail?

The document is not a valid attachment – Google Drive or Gmail might refuse to preview the document because it doesn't consider it a valid attachment. If the attachment is not considered secure, Google will not make it available for preview.


1 Answers

By default attachments are viewed securely using https://docs.google.com/gview, however it turns out you are allowed to request files over plain HTTP. This makes it a little bit easier to figure out what is going on using Wireshark.

As you indicated it was already clear that the PDF is converted on the server side to a PNG (ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.

However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document). As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers. Here's an excerpt:

<pdf2xml>
    <meta name="Author" content="Bruce van der Kooij"/>
    <meta name="Creator" content="Writer"/>
    <meta name="Producer" content="OpenOffice.org 3.0"/>
    <meta name="CreationDate" content="20090218171300+01'00'"/>
    <page t="0" l="0" w="595" h="842">
        <text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
        <text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
    </page>
</pdf2xml>

I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length. As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy. But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.

Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output. Here's the example from their homepage:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    </font>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
    </font>
  </page>
</pdf2xml>

Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them. It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions.

Good luck.

like image 197
Bruce van der Kooij Avatar answered Oct 11 '22 17:10

Bruce van der Kooij