Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all images from a board from a Pinterest web address

This question sounds easy, but it is not as simple as it sounds.

Brief summary of what's wrong

For an example, use this board; http://pinterest.com/dodo/web-designui-and-mobile/

Examining the HTML for the board itself (inside the div with the class GridItems) at the top of the page yields:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 0px; left: 0px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 3343px; left: 1000px; visibility: visible;">..</div>
</div>

Yet at the bottom of the page, after activating the infinite scroll a couple of times, we get this as the HTML:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 12431px; left: 750px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 19944px; left: 750px; visibility: visible;">..</div>
</div>

As you can see, some of the containers for the images higher up on the page have disappeared, and not all of the containers for the images load upon first loading the page.


What I want to do

I want to be able to create a C# script (or any server side language at the moment) that can download the page's full HTML (i.e., retrieve every image on the page), and the images will then be downloaded from their URLs. Downloading the webpage and using an appropriate XPath is easy, but the real challenge is downloading the full HTML for every image.

Is there a way I can emulate scrolling to the bottom of the page, or is there an even easier way that I can retrieve every image? I imagine that Pinterest use AJAX to change the HTML, is there a way I can programmatically trigger the events to receive all the HTML? Thank you in advance for suggestions and solutions, and kudos for even reading this very long question if you do not have any!

Pseudo code

using System;
using System.Net;
using HtmlAgilityPack;

private void Main() {
    string pinterestURL = "http://www.pinterest.com/...";
    string XPath = ".../img";

    HtmlDocument doc = new HtmlDocument();

    // Currently only downloads the first 25 images.
    doc.Load(strPinterestUrl);

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
    {
         image_links[] = link["src"];
         // Use image links
    }
}
like image 507
Nick Bull Avatar asked Aug 19 '13 11:08

Nick Bull


People also ask

Can you download all images from a Pinterest board?

Download Pinterest boardTo download pins at once, you will need a Chrome extension that allows you to save them in bulk. Although you will find multiple extensions in the Google store, today we will talk about 'PinDown'. However, you can decide which one you want to use.

How do I bulk download images from Pinterest?

Pincase is a browser extension that can batch download Pinterest pictures or videos, and supports batch downloading to folders. Pincase is a free Pinterest video downloader that makes downloading on Pinterest.com easier. It also provides download function to download pictures, GIFs, videos, Story Pins.

Does Pinterest have an API?

The Pinterest API (v5) allows developers to connect quickly and begin building applications to enable faster creation of content on Pinterest and access user, Pin and ads analytics.


2 Answers

Okay, so I think this may be (with a few alterations) what you need.

Caveats:

  1. This is PHP, not C# (but you said you were interested in any server-side language).
  2. This code hooks into (unofficial) Pinterest search endpoints. You'll need to change $data and $search_res to reflect the appropriate endpoints (eg. BoardFeedResouce) for your tasks. Note: at least for search, Pinterest currently uses two endpoints, one for the initial page load, and another for the infinite scroll actions. Each has their own expected param structure.
  3. Pinterest has no official public API, expect this to break whenever they change anything, and without warning.
  4. You may find pinterestapi.co.uk easier to implement and acceptable for what you're doing.
  5. I have some demo/debug code beneath the class that shouldn't be there once you're getting the data you want, and a default page fetch limit that you may want to change.

Points of interest:

  1. The underscore _ parameter takes a timestamp in JavaScript format, ie. like Unix time but it has milliseconds added. It's not actually used for pagination.
  2. Pagination uses the bookmarks property, so you make the first request to the 'new' endpoint which doesn't require it, and then take the bookmarks from the result and use it in your request to get the next 'page' of results, take the bookmarks from those results to fetch the next page after that, and so on until you run out of results or reach your pre-set limit (or you hit the server max for script execution time). I'd be curious to know exactly what the bookmarks field encodes. I would like to think there's some fun secret sauce beyond just a pin ID or some other page marker.
  3. I'm skipping the html, instead dealing with JSON, as it's easier (for me) than using a DOM manipulation solution, or a bunch of regex.
<?php

if(!class_exists('Skrivener_Pins')) {

  class Skrivener_Pins {

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Pinterest search function. Uses Pinterest's "internal" page APIs, so likely to break if they change.
     * @author [@skrivener] Philip Tillsley
     * @param $search_str     The string used to search for matching pins.
     * @param $limit          Max number of pages to get, defaults to 2 to avoid excessively large queries. Use care when passing in a value.
     * @param $bookmarks_str  Used internally for recursive fetches.
     * @param $pages          Used internally to limit recursion.
     * @return array()        int['id'], obj['image'], str['pin_link'], str['orig_link'], bool['video_flag']
     * 
     * TODO:
        * 
        * 
     */
    public function get_tagged_pins($search_str, $limit = 1, $bookmarks_str = null, $page = 1) {

      // limit depth of recursion, ie. number of pages of 25 returned, otherwise we can hang on huge queries
      if( $page > $limit ) return false;

      // are we getting a next page of pins or not
      $next_page = false;
      if( isset($bookmarks_str) ) $next_page = true;

      // build url components
      if( !$next_page ) {

        // 1st time
        $search_res = 'BaseSearchResource'; // end point
        $path = '&module_path=' . urlencode('SearchInfoBar(query=' . $search_str . ', scope=boards)');
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "scope":"pins",
            "show_scope_selector":true,
            "query":"' . $search_str . '"
          },
          "context":{
            "app_version":"2f83a7e"
          },
          "module":{
            "name":"SearchPage",
            "options":{
              "scope":"pins",
              "query":"' . $search_str . '"
            }
          },
          "append":false,
          "error_strategy":0
          }');
      } else {

        // this is a fetch for 'scrolling', what changes is the bookmarks reference, 
        // so pass the previous bookmarks value to this function and it is included
        // in query
        $search_res = 'SearchResource'; // different end point from 1st time search
        $path = '';
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "query":"' . $search_str . '",
            "bookmarks":["' . $bookmarks_str . '"],
            "show_scope_selector":null,
            "scope":"pins"
          },
          "context":{
            "app_version":"2f83a7e"
          },
            "module":{
              "name":"GridItems",
            "options":{
              "scrollable":true,
              "show_grid_footer":true,
              "centered":true,
              "reflow_all":true,
              "virtualize":true,
              "item_options":{
                "show_pinner":true,
                "show_pinned_from":false,
                "show_board":true
              },
              "layout":"variable_height"
            }
          },
          "append":true,
          "error_strategy":2
        }');
      }
      $data = urlencode($data);
      $timestamp = time() * 1000; // unix time but in JS format (ie. has ms vs normal server time in secs), * 1000 to add ms (ie. 0ms)

      // build url
      $url = 'http://pinterest.com/resource/' . $search_res . '/get/?source_url=/search/pins/?q=' . $search_str
          . '&data=' . $data
          . $path
          . '&_=' . $timestamp;//'1378150472669';

      // setup curl
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL, $url);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));

      // get result
      $curl_result = curl_exec ($ch); // this echoes the output
      $curl_result = json_decode($curl_result);
      curl_close ($ch);

      // clear html to make var_dumps easier to see when debugging
      // $curl_result->module->html = '';

      // isolate the pin data, different end points have different data structures
      if(!$next_page) $pin_array = $curl_result->module->tree->children[1]->children[0]->children[0]->children;
      else $pin_array = $curl_result->module->tree->children;

      // map the pin data into desired format
      $pin_data_array = array();
      $bookmarks = null;
      if(is_array($pin_array)) {
        if(count($pin_array)) {

          foreach ($pin_array as $pin) {

            //setup data
            $image_id = $pin->options->pin_id;
            $image_data = ( isset($pin->data->images->originals) ) ? $pin->data->images->originals : $pin->data->images->orig;
            $pin_url = 'http://pinterest.com/pin/' . $image_id . '/';
            $original_url = $pin->data->link;
            $video = $pin->data->is_video;

            array_push($pin_data_array, array(
              "id"          => $image_id,
              "image"       => $image_data,
              "pin_link"    => $pin_url,
              "orig_link"   => $original_url,
              "video_flag"  => $video,
              ));
          }
          $bookmarks = reset($curl_result->module->tree->resource->options->bookmarks);

        } else {
          $pin_data_array = false;
        }
      }

      // recurse until we're done
      if( !($pin_data_array === false) && !is_null($bookmarks) ) {

        // more pins to get
        $more_pins = $this->get_tagged_pins($search_str, $limit, $bookmarks, ++$page);
        if( !($more_pins === false) ) $pin_data_array = array_merge($pin_data_array, $more_pins);
        return $pin_data_array;
      }

      // end of recursion
      return false;
    }

  } // end class Skrivener_Pins
} // end if



/**
 * Debug/Demo Code
 * delete or comment this section for production
 */

// output headers to control how the content displays
// header("Content-Type: application/json");
header("Content-Type: text/plain");
// header("Content-Type: text/html");

// define search term
// $tag = "vader";
$tag = "haemolytic";
// $tag = "qjkjgjerbjjkrekhjk";

if(class_exists('Skrivener_Pins')) {

  // instantiate the class
  $pin_handler = new Skrivener_Pins();

  // get pins, pinterest returns 25 per batch, function pages through this recursively, pass in limit to 
  // override default limit on number of pages to retrieve, avoid high limits (eg. limit of 20 * 25 pins/page = 500 pins to pull 
  // and 20 separate calls to Pinterest)
  $pins1 = $pin_handler->get_tagged_pins($tag, 2);

  // display the pins for demo purposes
  echo '<h1>Images on Pinterest mentioning "' . $tag . '"</h1>' . "\n";
  if( $pins1 != false ) {
    echo '<p><em>' . count($pins1) . ' images found.</em></p>' . "\n";
    skrivener_dump_images($pins1, 5);
  } else {
    echo '<p><em>No images found.</em></p>' . "\n";
  }
}

// demo function, dumps images in array to html img tags, can pass limit to only display part of array
function skrivener_dump_images($pin_array, $limit = false) {
  if(is_array($pin_array)) {
    if($limit) $pin_array = array_slice($pin_array, -($limit));
    foreach ($pin_array as $pin) {
      echo '<img src="' . $pin['image']->url . '" width="' . $pin['image']->width . '" height="' . $pin['image']->height . '" >' . "\n";
    }
  }
}

?>

Let me know if you run into problems getting this adapted to your particular end points. Apols for any sloppiness in the code, it didn't make it to production originally.

like image 123
Skrivener Avatar answered Sep 29 '22 14:09

Skrivener


Probably a bit late but, with py3-pinterest open source project you can do it easily:

First get all pins as objects from the board, they include the original image url also.

# get all pins for the board
board_pins = []
pin_batch = pinterest.board_feed(board_id=target_board['id'], board_url=target_board['url'])

while len(pin_batch) > 0:
    board_pins += pin_batch
    pin_batch = pinterest.board_feed(board_id=target_board['id'], board_url=target_board['url'])

Then you can obtain the image urls and download them or do whatever you like with them

for pin in board_pins:
    url = pin['image']
    # process image url..

Full code example: https://github.com/bstoilov/py3-pinterest/blob/master/download_board_images.py

Yes its python but if you still insist on c# it should be easy to port it :)

like image 24
Borislav Stoilov Avatar answered Sep 29 '22 15:09

Borislav Stoilov