Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking if a website exist with python3

Tags:

python-3.x

Sorry if this is a duplicate i've been looking for answers for about an hour and can't seem to find any. Anyways, I have a text file full of urls and i want to check each one to see whether it exist or not. I need some help understanding the error message and if there are any ways to fix it or different methods i can use.

Here's my code

import requests

filepath = 'url.txt'  
with open(filepath) as fp:  
   url = fp.readline()
   count = 1
   while count != 677: #Runs through each line of my txt file
      print(url)
      request = requests.get(url) #Here is where im getting the error
      if request.status_code == 200:
          print('Web site exists')
      else:
        print('Web site does not exist')
      url = url.strip()
      count += 1 

And this is the output

http://www.pastaia.co

Traceback (most recent call last):
File "python", line 9, in <module>
requests.exceptions.ConnectionError: 
HTTPConnectionPool(host='www.pastaia.co%0a', port=80): Max retries exceeded 
with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection 
object at 0x7fca82769e10>: Failed to establish a new connection: [Errno -2] 
Name or service not known',))
like image 497
Ayden Hubenak Avatar asked Oct 23 '25 02:10

Ayden Hubenak


1 Answers

I'll throw in ideas to get you started, whole careers are built around spidering :) btw, http://www.pastaia.co seems to just be down. And that's a big part of the trick, how to handle the unexpected when crawling the web. Ready? Here we go...

import requests

filepath = 'url.txt'
with open(filepath) as fp:
    for url in fp:
        print(url)
        try:
            request = requests.get(url) #Here is where im getting the error
            if request.status_code == 200:
                print('Web site exists')
        except:
            print('Web site does not exist')
  • make it a for loop, you just want to loop the whole file right?
  • do a try and except that way if it blows up for whatever reason of which there can be lots like, bad DNS, non 200 returned, perhaps it's a .pdf page, the web is the wild wild west. This way the code won't crash and you can check the next site in the list and just record the error however you'd like.
  • you can add other kinds of conditions in there too, perhaps the page needs to be a certain length? And just because it's a response code 200 doesn't always mean the page is valid, just that the site returned success, but it's a good place to start.
  • consider adding a user-agent to your request, you may want to mimic a browser, or perhaps have your program identify itself as super bot 9000
  • if you want to get further into spidering and parsing of the text, look at using beautifulsoup: https://www.crummy.com/software/BeautifulSoup/
like image 97
sniperd Avatar answered Oct 25 '25 17:10

sniperd