Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return selenium browser (or how to import a def that return selenium browser)

I would like to start a selenium browser with a particular setup (privoxy, Tor, randon user agent...) in a function and then call this function in my code. I have created a python script mybrowser.py with this inside:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from fake_useragent import UserAgent
from stem import Signal
from stem.control import Controller

class MyBrowserClass:
    def start_browser():
        service_args = [
            '--proxy=127.0.0.1:8118',
            '--proxy-type= http',
            ]
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = (UserAgent().random)

        browser = webdriver.PhantomJS(service_args = service_args,         desired_capabilities=dcap)
        return browser

    def set_new_ip():
        with Controller.from_port(port=9051) as controller:
            controller.authenticate(password=password) 
            controller.signal(Signal.NEWNYM)

Then I import it into another script myscraping.py with this inside:

import mybrowser
import time

browser= mybrowser.MyBrowserClass.start_browser()
browser.get("https://canihazip.com/s")
print(browser.page_source)
mybrowser.MyBrowserClass.set_new_ip()
time.sleep(12) 
browser.get("https://canihazip.com/s")
print(browser.page_source)

The browser is working - I can access the page and retrieve it with .page_source.

But the IP doesn't change between the first and the second print. If I move the content of the function inside myscraping.py (and remove the import + function call) then the IP change.

Why? Is it a problem with returning the browser? How can I fix this?


Actually, the situation is a bit more complex. When I connect to https://check.torproject.org before and after the call to mybrowser.set_new_ip() and the wait of 12 sec (cf the lines below), the IP given by the webpage changes between the first and the second call. So my Ip is changed (according to Tor) but neither https://httpbin.org/ip nor icanhazip.com detects the change in the IP.

...
browser.get("https://canihazip.com/s")
print(browser.page_source)
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[@class="content"]').text )
mybrowser.set_new_ip()
time.sleep(12) 
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[@class="content"]').text )
browser.get("https://canihazip.com/s")
print(browser.page_source)

So the IP that are printed are like that:

42.38.215.198 (canihazip before mybrowser.set_new_ip() )
42.38.215.198  (check.torproject before mybrowser.set_new_ip() )
106.184.130.30  (check.torproject after mybrowser.set_new_ip() )
42.38.215.198 (canihazip after  mybrowser.set_new_ip())

Privoxy configuration: in C:\Program Files (x86)\Privoxy\config.txt, I have uncommented this line (9050 is the port Tor uses):

forward-socks5t   /               127.0.0.1:9050 

Tor configuration: in torcc, I have this:

ControlPort 9051
HashedControlPassword : xxxx
like image 544
MagTun Avatar asked Dec 25 '17 14:12

MagTun


1 Answers

This is probably because of PhantomJS keeping a memory cache of requested content. So your first visit using a PhantomJS browser can have a dynamic result but that result is then cached and each consecutive visit uses that cached page.

This memory cache has caused issues like CSRF-Token's not changing on refresh and now I believe it is the root cause of your problem. The issue was presented and resolved in 2013 but the solution is a method, clearMemoryCache, found in PhantomJS's WebPage class. Sadly, we are dealing with a Selenium webdriver.PhantomJS instance.

So, unless I am overseeing something, it'd be tough to access this method through Selenium's abstraction.

The only solution I see fit is to use another webdriver that doesn't have a memory cache like PhantomJS's. I have tested it using Chrome and it works perfectly:

103.***.**.***
72.***.***.***

Also, as a side note, Selenium is phasing out PhantomJS:

UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

like image 110
Cole Avatar answered Oct 21 '22 22:10

Cole