I'm using Selenium with Python bindings to scrape AJAX content from a web page with headless Firefox. It works perfectly when run on my local machine. When I run the exact same script on my VPS, errors get thrown on seemingly random (yet consistent) lines. My local and remote systems have the same exact OS/architecture, so I'm guessing the difference is VPS-related.
For each of these tracebacks, the line is run 4 times before an error is thrown.
I most often get this URLError when executing JavaScript to scroll an element into view.
File "google_scrape.py", line 18, in _get_data
driver.execute_script("arguments[0].scrollIntoView(true);", e)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
{'script': script, 'args':converted_args})['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
Occasionally I'll get this BadStatusLine when reading text from an element.
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
A couple times I've gotten a socket error:
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer
I'm scraping from Google without a proxy, so my first thought was that my IP address is recognized as a VPS and put under a 5-time page-manipulation limitation or something. But my initial research indicates that these errors would not arise from being blocked.
Any insight into what these errors mean collectively, or on the necessary considerations when making HTTP requests from a VPS would be much appreciated.
After a little thinking and looking into what a webdriver really is -- automated browser input -- I should have been confused about why remote_connection.py
is making urllib2
requests at all. It would seem that the text
method of the WebElement
class is an "extra" feature of the python bindings that isn't part of the Selenium core. That doesn't explain the above errors, but it may indicate that the text
method shouldn't be used for scraping.
I realized that, for my purposes, Selenium's only function is getting the ajax content to load. So after the page loads, I'm parsing the source with lxml
rather than getting elements with Selenium, i.e.:
html = lxml.html.fromstring(driver.page_source)
However, page_source
is yet another method that results in a call to urllib2
, and I consistently get the BadStatusLine
error the second time I use it. Minimizing urllib2
requests is definitely a step in the right direction.
Eliminating urllib2
requests by grabbing the source with javascript is better yet:
html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))
These errors can be avoided by doing a time.sleep(10)
between every few requests. The best explanation I've come up with is that Google's firewall recognizes my IP as a VPS and therefore puts it under a stricter set of blocking rules.
This was my initial thought, but I still find it hard to believe because my web searches return no indication that the above errors could be caused by a firewall.
If this is the case though, I would think the stricter rules could be circumvented with a proxy, though that proxy might have to be a local system or tor to avoid the same restrictions.
As per our conversation, you discovered that even for a small number of daily scrapes, Google has anti-scraping blocking in place. The solution is to put a delay of a few seconds between each fetch.
In the general case, since you are technically transferring non-recoverable costs to a third party, it is always good practice to try to reduce the extra resource load you are placing upon the remote server. Without pauses between HTTP fetches, a fast server and connection can cause a remote denial of service, especially to scrape targets that do not have Google's server resources.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With