I am trying to extract data from Civic Commons Apps link for my project. I am able to obtain the links of the page that I need. But when I try to open the links I get "urlopen error [Errno -2] Name or service not known"
The web scraping python code:
from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
list_of_next_pages = []
for categorized_apps_url in list_of_links:
categorized_apps_page = urllib2.urlopen(categorized_apps_url)
categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())
last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
if last_page_tag:
last_page_url = base_url+last_page_tag.get('href')
index_value = last_page_url.find("page=") + 5
base_url_for_next_page = last_page_url[:index_value]
for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
list_of_next_pages.append(base_url_for_next_page+str(pageno))
else:
list_of_next_pages.append(categorized_apps_url)
I get the following error:
urllib2.urlopen(categorized_apps_url)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
Should I take care of anything specific when I perform urlopen? Because I don't see a problem with the http links that I get.
[edit] On second run I got the following error:
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
The same code runs fine in my friend's Mac, but fails in my ubuntu 12.04.
Also I tried running the code in scraper wiki and it finished successfully. But few url's were missing (when compared to mac). Are there any reason for these behavior?
The code works on my Mac and on your friends mac. It runs fine from a virtual machine instance of Ubuntu 12.04 server. There is obviously something in your particular environment - your os (Ubuntu Desktop?) or network that is causing it to crap out. For example my home router's default setting throttles the number of calls to the same domain in x seconds - and could cause this kind of issue if I didn't turn it off. It could be a number of things.
At this stage I would suggest refactoring your code to catch the URLError
and set aside problematic urls for a retry. Also log/print errors if they fail after several retries. Maybe even throw in some code to time your calls between errors. It is better than having your script just fail outright and you'll get feedback as to whether it is just particular urls causing the problem or a timing issue (i.e. does it fail after x number of urlopen
calls, or if it is failing after x number of urlopen
calls in x amount of micro/seconds). If it's a timing issue, a simple time.sleep(1)
inserted into your loops might do the trick.
SyncMaster,
I ran into the same issue recently after jumping onto an old ubuntu box I haven't played with in a while. This issue is actually caused because of the DNS settings on your machine. I would highly recommend that you check your DNS settings (/etc/resolv.conf and add nameserver 8.8.8.8) and then try again, you should meet success.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With