Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the connection refused?

I am creating a web scraping script and divided it into four pieces. Separately they all work perfect, however when I put them all together I get the following error : urlopen error [Errno 111] Connection refused. I have looked at similar questions to mine and have tried to catch the error with try-except but even that doesn`t work. My all in one code is :

from selenium import webdriver
import re
import urllib2
site = ""

def phone():
    global site
    site = "https://www." + site
    if "spokeo" in site:
        browser = webdriver.Firefox()
        browser.get(site)
        content = browser.page_source
        browser.quit()
        m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
        if m_obj:    
            print m_obj.group(0)    
    elif "addresses" in site:
        usock = urllib2.urlopen(site)
        data = usock.read()
        usock.close()
        m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
        if m_obj:    
            print m_obj.group(0)
    else :
        usock = urllib2.urlopen(site)
        data = usock.read()
        usock.close()
        m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
        if m_obj:    
            print m_obj.group(0)

def pipl():
    global site
    url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
    usock = urllib2.urlopen(url)
    data = usock.read()
    usock.close()
    r_list = [#re.compile("spokeo.com/[^\s]+"),
             re.compile("addresses.com/[^\s]+"),
             re.compile("10digits.us/[^\s]+")]
    for r in r_list:
        match = re.findall(r,data)
        for site in match:
            site = site[:-6]
            print site
            phone()

pipl()

Here is my traceback:

Traceback (most recent call last):
  File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
    pipl()
  File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
    phone()
  File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
    usock = urllib2.urlopen(site)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

After manually debugging the code I found that the error comes from the function phone(), so I tried to run just that piece :

import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
    print m_obj.group(0)

And it worked. Which, I believe, shows it`s not that the firewall is actively denying the connection or the respective service is not started on the other site or is overloaded. Any help would be apreciated.

like image 569
Peter Lazarov Avatar asked Feb 14 '23 03:02

Peter Lazarov


1 Answers

Usually the devil is in the detail.

according to your traceback...

File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)

and your source code...

site = "https://www." + site

...I may suppose that in your code you are trying to access https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7 whereas in your test you are connecting to http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7.

try to replace the https with http (at least for www.10digits.us): probably the website you are trying to scraping does not respond to the port 443 but only to the port 80 (you can check it even with your browser)

like image 55
furins Avatar answered Feb 20 '23 12:02

furins