I would like to start a selenium browser with a particular setup (privoxy, Tor, randon user agent...) in a function and then call this function in my code. I have created a python script mybrowser.py
with this inside:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from fake_useragent import UserAgent
from stem import Signal
from stem.control import Controller
class MyBrowserClass:
def start_browser():
service_args = [
'--proxy=127.0.0.1:8118',
'--proxy-type= http',
]
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (UserAgent().random)
browser = webdriver.PhantomJS(service_args = service_args, desired_capabilities=dcap)
return browser
def set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password=password)
controller.signal(Signal.NEWNYM)
Then I import it into another script myscraping.py
with this inside:
import mybrowser
import time
browser= mybrowser.MyBrowserClass.start_browser()
browser.get("https://canihazip.com/s")
print(browser.page_source)
mybrowser.MyBrowserClass.set_new_ip()
time.sleep(12)
browser.get("https://canihazip.com/s")
print(browser.page_source)
The browser is working - I can access the page and retrieve it with .page_source
.
But the IP doesn't change between the first and the second print. If I move the content of the function inside myscraping.py
(and remove the import + function call) then the IP change.
Why? Is it a problem with returning the browser? How can I fix this?
Actually, the situation is a bit more complex. When I connect to https://check.torproject.org
before and after the call to mybrowser.set_new_ip()
and the wait of 12 sec
(cf the lines below), the IP given by the webpage changes between the first and the second call. So my Ip is changed (according to Tor) but neither https://httpbin.org/ip
nor icanhazip.com
detects the change in the IP.
...
browser.get("https://canihazip.com/s")
print(browser.page_source)
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[@class="content"]').text )
mybrowser.set_new_ip()
time.sleep(12)
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[@class="content"]').text )
browser.get("https://canihazip.com/s")
print(browser.page_source)
So the IP that are printed are like that:
42.38.215.198 (canihazip before mybrowser.set_new_ip() )
42.38.215.198 (check.torproject before mybrowser.set_new_ip() )
106.184.130.30 (check.torproject after mybrowser.set_new_ip() )
42.38.215.198 (canihazip after mybrowser.set_new_ip())
Privoxy configuration: in C:\Program Files (x86)\Privoxy\config.txt
, I have uncommented this line (9050 is the port Tor uses):
forward-socks5t / 127.0.0.1:9050
Tor configuration: in torcc
, I have this:
ControlPort 9051
HashedControlPassword : xxxx
This is probably because of PhantomJS keeping a memory cache of requested content. So your first visit using a PhantomJS browser can have a dynamic result but that result is then cached and each consecutive visit uses that cached page.
This memory cache has caused issues like CSRF-Token
's not changing on refresh and now I believe it is the root cause of your problem. The issue was presented and resolved in 2013 but the solution is a method, clearMemoryCache
, found in PhantomJS's WebPage
class. Sadly, we are dealing with a Selenium webdriver.PhantomJS
instance.
So, unless I am overseeing something, it'd be tough to access this method through Selenium's abstraction.
The only solution I see fit is to use another webdriver that doesn't have a memory cache like PhantomJS's. I have tested it using Chrome and it works perfectly:
103.***.**.***
72.***.***.***
Also, as a side note, Selenium is phasing out PhantomJS:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With