Python

Question

Note: Can be any solution, selenium seems like the most likely tool to solve this.

Imgur has albums, the image links of the albums are stored in (a React element?) GalleryPost.album_image_store._.posts.{ALBUM_ID}.images (thanks to this guy for figuring this out).

Using React DevTools extension for chrome I can see this array of image links, but I want to be able to access this from a python script.

Any ideas how?

P.s. I don't know much at all about react, so please excuse my if this is a stupid question or for possibly using incorrect terminology.

Here's the album I've been working with: https://i.sstatic.net/545pu.jpg

Implemented Solution:

Huge thanks to Eduard Florinescu for working with me to figure all this out. Didn't know hardly anything about selenium, how to run javascript in selenium, or any commands I could use.

Modifying some of his code, I ended up with the following.

from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver  
from selenium.webdriver.chrome.options import Options


# Snagged from: https://stackoverflow.com/a/480227
def rmdupe(seq):
    # Removes duplicates from list
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]


chrome_options = Options()  
chrome_options.add_argument("--headless")  

prefs = {"profile.managed_default_content_settings.images":2}
chrome_options.add_experimental_option("prefs",prefs)

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_window_size(1920, 10000)
driver.get("https://i.sstatic.net/545pu.jpg")


links = []
for i in range(0, 10):  # Tune as needed
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for div in soup.find_all('div', {'class': 'image post-image'}):
        imgs = div.find_all('img')
        for img in imgs:
            srcs = img.get_attribute_list('src')
            links.extend(srcs)
        sources = div.find_all('source')
        for s in sources:
            srcs = s.get_attribute_list('src')
            links.extend(srcs)
    links = rmdupe(links)  # Remove duplicates
    driver.execute_script('window.scrollBy(0, 750)')
    sleep(.2)

>>> len(links)
# 36 -- Huzzah! Got all the album links!

Notes:

Creates a headless chrome instance, so the code can be implemented in a script or potentially a larger project.
I used BeautifulSoup because it's a bit easier to work with and I was having some issues with finding elements and accessing their values using selenium (likely due to inexperience).
Set the display size to be really "tall" so more image links are loaded at once.
Disabled images in chrome browser settings to stop the browser from actually downloading the images (all I need are the links).
Some links are .mp4 files and are rendered in html as video elements with <source> tags contained inside which contain the link. The portion of code starting with sources = div.find_all('source') is there to make sure no album links are lost.

Eduard Florinescu · Accepted Answer

You don't need to know any framework to automate any page. You need to just access the DOM and you can do that with selenium and python. But sometimes some simple Vanilla JavaScript helps.

To get those links you can try and paste this in console:

images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} console.log(images_links)

Also the selenium with python and the above JS snippet is:

import selenium
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()

driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(2)
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links

enter image description here

Update:

you don't need selenium just paste this in an Opera console (see that you enable multiple Downloads) and voila:

document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement('a'); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }

same thing beautified for reading:

document.body.style.zoom=0.1;
images = document.querySelectorAll("img");
for (i of images) {
    var a = document.createElement('a');
    a.href = i.src;
    console.log(i);
    a.download = i.src;
    document.body.appendChild(a);
    a.click();
    document.body.removeChild(a);
}

Update 2 Opera webdriver

import os
from time import sleep
from selenium import webdriver
from selenium.webdriver.common import desired_capabilities
from selenium.webdriver.opera import options

_operaDriverLoc = os.path.abspath('c:\Python27\Scripts\operadriver.exe')  # Replace this path with the actual path on your machine.
_operaExeLoc = os.path.abspath('c:\Program Files\Opera\51.0.2830.34\opera.exe')   # Replace this path with the actual path on your machine.

_remoteExecutor = 'http://127.0.0.1:9515'
_operaCaps = desired_capabilities.DesiredCapabilities.OPERA.copy()

_operaOpts = options.ChromeOptions()
_operaOpts._binary_location = _operaExeLoc

# Use the below argument if you want the Opera browser to be in the maximized state when launching.
# The full list of supported arguments can be found on http://peter.sh/experiments/chromium-command-line-switches/
_operaOpts.add_argument('--start-maximized')

driver = webdriver.Chrome(executable_path = _operaDriverLoc, chrome_options = _operaOpts, desired_capabilities = _operaCaps)


driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(4)
driver.execute_script("document.body.style.zoom=0.1")
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links
driver.execute_script('document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement("a"); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }')

Python - Access React Props using Selenium

Tags:

javascript

reactjs

selenium

Implemented Solution:

Bobs Burgers

1 Answers

Update:

Update 2 Opera webdriver

Eduard Florinescu

Recent Activity

Donate For Us

Python - Access React Props using Selenium

Tags:

python

javascript

reactjs

selenium

Implemented Solution:

Bobs Burgers

1 Answers

Update:

Update 2 Opera webdriver

Eduard Florinescu

Related questions

Recent Activity

Donate For Us