I am creating a web scraping script and divided it into four pieces. Separately they all work perfect, however when I put them all together I get the following error : urlopen error [Errno 111] Connection refused. I have looked at similar questions to mine and have tried to catch the error with try-except but even that doesn`t work. My all in one code is :
from selenium import webdriver
import re
import urllib2
site = ""
def phone():
global site
site = "https://www." + site
if "spokeo" in site:
browser = webdriver.Firefox()
browser.get(site)
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
elif "addresses" in site:
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
else :
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
def pipl():
global site
url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
r_list = [#re.compile("spokeo.com/[^\s]+"),
re.compile("addresses.com/[^\s]+"),
re.compile("10digits.us/[^\s]+")]
for r in r_list:
match = re.findall(r,data)
for site in match:
site = site[:-6]
print site
phone()
pipl()
Here is my traceback:
Traceback (most recent call last):
File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
pipl()
File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
phone()
File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
usock = urllib2.urlopen(site)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
After manually debugging the code I found that the error comes from the function phone(), so I tried to run just that piece :
import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
And it worked. Which, I believe, shows it`s not that the firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. Any help would be apreciated.
Usually the devil is in the detail.
according to your traceback...
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
and your source code...
site = "https://www." + site
...I may suppose that in your code you are trying to access https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7
whereas in your test you are connecting to http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7
.
try to replace the https
with http
(at least for www.10digits.us
): probably the website you are trying to scraping does not respond to the port 443 but only to the port 80 (you can check it even with your browser)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With