I watched the traffic when google displays PDF attachments in gmail in a new window. The content is served as PNG images for each PDF page. And its text can be selected. What does google use on server side to generate a PNG file for a particular page in a pdf file? How does the selection of text on a png file work? Any ideas?
Type "filename:[file type]" (without the quotes) into the box. For example, if you want to find all the PDFs in your mailbox, type "filename:pdf" in the search area.
Go to Settings. Go to Apps. Select the other PDF app, that always open up automatically. Scroll down to "Launch By Default" or "Open by default".
The document is not a valid attachment – Google Drive or Gmail might refuse to preview the document because it doesn't consider it a valid attachment. If the attachment is not considered secure, Google will not make it available for preview.
By default attachments are viewed securely using https://docs.google.com/gview, however it turns out you are allowed to request files over plain HTTP. This makes it a little bit easier to figure out what is going on using Wireshark.
As you indicated it was already clear that the PDF is converted on the server side to a PNG (ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.
However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document). As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers. Here's an excerpt:
<pdf2xml>
<meta name="Author" content="Bruce van der Kooij"/>
<meta name="Creator" content="Writer"/>
<meta name="Producer" content="OpenOffice.org 3.0"/>
<meta name="CreationDate" content="20090218171300+01'00'"/>
<page t="0" l="0" w="595" h="842">
<text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
<text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
</page>
</pdf2xml>
I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length. As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy. But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.
Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output. Here's the example from their homepage:
<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
<title>My Title</title>
<page width="780" height="1152">
<font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
<text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
<img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
<link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
</font>
<font size="12" face="AGaramond-Regular" italic="true" bold="true">
<text x="509" y="68" width="121" height="12">This is a test PDF file</text>
<link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
</font>
</page>
</pdf2xml>
Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them. It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions.
Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With