Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests.get showing 404 while url does exists

http://www.leboncoin.fr/montres_bijoux/671762293.htm

I'm trying to open this url

import requests
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36'
s.headers['Host'] = 'www.leboncoin.fr'
url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'
r = s.get(url)
print r.text

when I run this script it shows this error, in my terminal,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /montres_bijoux/671762293.htm  was not found on this server.</p>
</body></html>

while I can open same url in my browser and can see content.

What could be the issue??

like image 855
user3810188 Avatar asked Jul 23 '14 18:07

user3810188


1 Answers

Without even waiting for your test, I'm pretty confident I know what your bug is.

I put this url manually in function call it works fine but if I read that file and directly call function with that url, give me error. I have put 3-4 checks while reading file, url is perfectly coming form the file even I tried to print that url inside the called function and I'm receiving that url in function too. still have no clue what is happening ?

Most likely you're reading the URL with something like for line in file: or file.readline or some other function that preserves newlines. So, what you actually end up with is not this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… but this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

The latter will be escaped by requests into something that's a perfectly good URL for a resource that doesn't exist, hence the 404 error.

The best way to check this is to print repr(url) instead of print(url). This will also find other possible problems, like embedded nonprintable characters. It won't find everything, like Unicode characters that look like . but actually aren't, but it's a good first test. (And if that doesn't find it, for a second test, copy and paste from the output, quotes and all, into your test script.)

If this is the problem, the fix is simple:

url = url.rstrip()
like image 158
abarnert Avatar answered Oct 20 '22 10:10

abarnert