Python Web Scraping - urlopen error [Errno -2] Name or service not known

Question

I am trying to extract data from Civic Commons Apps link for my project. I am able to obtain the links of the page that I need. But when I try to open the links I get "urlopen error [Errno -2] Name or service not known"

The web scraping python code:

from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb

base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

list_of_links = [] 

for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
   string_temp_link = base_url+link_tag.get('href')
   list_of_links.append(string_temp_link)

list_of_links = list(set(list_of_links)) 

list_of_next_pages = []
for categorized_apps_url in list_of_links:
   categorized_apps_page = urllib2.urlopen(categorized_apps_url)
   categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())

   last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
   if last_page_tag:
      last_page_url = base_url+last_page_tag.get('href')
      index_value = last_page_url.find("page=") + 5
      base_url_for_next_page = last_page_url[:index_value]
      for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
         list_of_next_pages.append(base_url_for_next_page+str(pageno))
      
   else:
      list_of_next_pages.append(categorized_apps_url)

I get the following error:

urllib2.urlopen(categorized_apps_url)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>

Should I take care of anything specific when I perform urlopen? Because I don't see a problem with the http links that I get.

[edit] On second run I got the following error:

 File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)

The same code runs fine in my friend's Mac, but fails in my ubuntu 12.04.

Also I tried running the code in scraper wiki and it finished successfully. But few url's were missing (when compared to mac). Are there any reason for these behavior?

Mark Gemmill · Accepted Answer

The code works on my Mac and on your friends mac. It runs fine from a virtual machine instance of Ubuntu 12.04 server. There is obviously something in your particular environment - your os (Ubuntu Desktop?) or network that is causing it to crap out. For example my home router's default setting throttles the number of calls to the same domain in x seconds - and could cause this kind of issue if I didn't turn it off. It could be a number of things.

At this stage I would suggest refactoring your code to catch the URLError and set aside problematic urls for a retry. Also log/print errors if they fail after several retries. Maybe even throw in some code to time your calls between errors. It is better than having your script just fail outright and you'll get feedback as to whether it is just particular urls causing the problem or a timing issue (i.e. does it fail after x number of urlopen calls, or if it is failing after x number of urlopen calls in x amount of micro/seconds). If it's a timing issue, a simple time.sleep(1) inserted into your loops might do the trick.

Sly D · Answer

SyncMaster,

I ran into the same issue recently after jumping onto an old ubuntu box I haven't played with in a while. This issue is actually caused because of the DNS settings on your machine. I would highly recommend that you check your DNS settings (/etc/resolv.conf and add nameserver 8.8.8.8) and then try again, you should meet success.

Python Web Scraping - urlopen error [Errno -2] Name or service not known

Tags:

python

beautifulsoup

web-scraping

SyncMaster

2 Answers

Mark Gemmill

Sly D

Recent Activity

Donate For Us

Python Web Scraping - urlopen error [Errno -2] Name or service not known

Tags:

python

beautifulsoup

web-scraping

SyncMaster

2 Answers

Mark Gemmill

Sly D

Related questions

Recent Activity

Donate For Us