Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Requests getting ('Connection aborted.', BadStatusLine("''",)) error

def download_torrent(url):
    fname = os.getcwd() + '/' + url.split('title=')[-1] + '.torrent'
    try:
        schema = ('http:')
        r = requests.get(schema + url, stream=True)
        with open(fname, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
    except requests.exceptions.RequestException as e:
        print('\n' + OutColors.LR + str(e))
        sys.exit(1)

    return fname

In that block of code I am getting an error when I run the full script. When I go to actually download the torrent, I get:

('Connection aborted.', BadStatusLine("''",))

I only posted the block of code that I think is relevant above. The entire script is below. It's from pantuts, but I don't think it's maintained any longer, and I am trying to get it running with python3. From my research, the error might mean I'm using http instead of https, but I have tried both.

Original script

like image 745
eurabilis Avatar asked Oct 16 '15 16:10

eurabilis


Video Answer


3 Answers

The error you get indicates the host isn't responding in the expected manner. In this case, it's because it detects that you're trying to scrape it and deliberately disconnecting you.

If you try your requests code with this URL from a test website: http://mirror.internode.on.net/pub/test/5meg.test1, you'll see that it downloads normally.

To get around this, fake your user agent. Your user agent identifies your web browser, and web hosts commonly check it to detect bots.

Use the headers field to set your user agent. Here's an example which tells the webhost you're Firefox.

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
r = requests.get(url, headers=headers)

There are lots of other discrepancies1 between bots and human-operated browsers that web hosts can check for, but user agent is one of the easiest and common ones.

If you want your scraper to be harder to detect, you'll want to use a headless browser like headless Chrome2 (or ghost.py if you want to stick with Python), which you can trust will behave like a real browser (because it is!).


Footnotes:

1Possible other checks include checks for if images aren't being downloaded, page resources aren't downloaded in the normal order, pages being downloaded faster than a human can read them, and cookies not being set properly. Google flags mouse movements deemed insufficiently human-like.

2Headless Chrome is the most competent headless browser in 2018, but if its weight is a problem for you, its slightly-outdated predecessors, PhantomJS and ghost.py, are lighter weight and still usable.

like image 100
sorbet Avatar answered Oct 09 '22 13:10

sorbet


try this:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0',
    'ACCEPT' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'ACCEPT-ENCODING' : 'gzip, deflate, br',
    'ACCEPT-LANGUAGE' : 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
    'REFERER' : 'https://www.google.com/'
}

    r = requests.get("http://yourdomain.com/", headers=headers)
like image 36
Mkurbanov Avatar answered Oct 09 '22 11:10

Mkurbanov


In my case, i must remove the user agent fields from headers

url='https://...'
headers = {}
requests.get(url, headers=headers)

once i set 'User-Agent', it getting ('Connection aborted.', BadStatusLine("''",)) and this error occurs only with the individual site. my first post,i get many helps from this site, hope it can help others who find here

like image 39
M.ison Avatar answered Oct 09 '22 12:10

M.ison