Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google webpage thumbnails absolute URI

How can I get a list of either the absolute URI's or base64 encodings for page URL's in Google's search results?

Goal:

Iterate through URL array:

pages["pinelakedesign.com"];
pages["pinelakedesign.com/about"];
pages["pinelakedesign.com/contact"];

Output:

  • Google thumbnail 1
  • Google thumbnail 2
  • Google thumbnail N

Google is using base64 string encoding of thumbnail JPG images for their visual search results. In 2011 this thumbnail service changed from the previous system with the magnifying glass and absolute URI construction described in this question: https://stackoverflow.com/questions/6881319/google-web-thumbnails

I just want to tile out a list of the the pages in a website as Google thumbnails so I know which pages have been indexed and thumbnailed at a glance and what those thumbs all look like.

Google search results thumbnail preview

Edit Nov 5, 2011

I identified that a call to this URL returns JSONP with the base64 encoding, Google search result title, description and URL.

https://clients1.google.com/webpagethumbnail?r=4&f=3&s=400:585&query=pine+lake+design&hl=en&gl=us&c=29&d=http%3A%2F%2Fwww.pinelakedesign.com%2F&b=1&j=google.nyc.c.j_pVK1Tu_gAbODsAKH0ZTuAw_3787232970_3&expi=17291,27615,28936,30049,30316,31215,32035,32271,32410,32940,33104,33194,33627,33788,33854,33907,33975,34103&a=2NT

The query= parameter is what was searched in Google. d= is the destination of the link, and possibly the source of the thumbnail. s=400:585 is the height and width. I am not sure what r=4 and f=3 do. Modifying any of these variables results in a 404 error. My hunch is that the expi= is some sort of checksum expiration algorithm based on the different parameter values but I don't know.

Returned JSONP:

google.nyc.c.j_pVK1Tu_gAbODsAKH0ZTuAw_3787232970_3({"s":"b","b":1,"quality":100,"shards":[{"heights":[300,131],"imgs":["data:image/jpeg;base64,/9j/4AAQSkZ ...THIS IS THE LONG BASE64 ENCONDING ...pa5r61f/9k="],"tbts":[{"box":{"h":15,"l":0,"t":39,"w":224},"txt":"<em>Pine Lake</em> specializes in small business website <em>design</em>, redesign and hosting. We have developed the Sungem content management system which allows our <b>...</b>","txtBox":{"h":57,"l":0,"t":58,"w":400}}]}],"url":"http://www.pinelakedesign.com/"}
)

Update Nov 8, 2011

I am looking for some solution like emedly's Preview for viewing Google thumbnails.

Update Feb 9, 2012

Using Phantom JS looks like a good way to achieve server-side remote snapshots but it does not help identify how to get at Google's images.

Update Mar 26, 2012

I believe Google's search spider is a headless version of desktop Chrome running 1024px wide resolution. A Chrome spider would allow the spider to execute Javascript, use @font-face, CSS3 selectors, view Flash (even waiting for preloader to reaches 100%) and take accurate snapshots of the rendered pages after loading all assets and DOM manipulation. Would anybody from Google please weigh in to confirm or deny anything?

like image 762
Dylan Valade Avatar asked Oct 27 '11 21:10

Dylan Valade


1 Answers

Basically, they make an curl request for the query url first and then get the missing "a" parameter from the html response. Then they use it to construct the correct url and make the api call to google API to get the image. After that there is more complex work like merging resultant images with ImageMagick to get a full preview, but that's a plus...

like image 93
brunobar79 Avatar answered Oct 09 '22 20:10

brunobar79