Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTTP 403 Responses when using Python Scrapy

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

The code is executing without any errors, however of the 4623 pages scraped, 217 got a HTTP response code of 200, 2 got a code of 302 and 4404 got a 403 response. Can anyone see anything immediately obvious in the code as to why this might be? Could this be an anti Scraping measure from the site? Is it usual practice to slow the number of submissions made to stop this happening?

Thanks

like image 611
gdogg371 Avatar asked Jul 17 '14 21:07

gdogg371


People also ask

How do I fix HTTP Error 403 Forbidden in Python?

This error is caused due to mod security detecting the scraping bot of the urllib and blocking it. Therefore, in order to resolve it, we have to include user-agent/s in our scraper.

How do I bypass 403 Forbidden requests in Python?

If you still get a 403 Forbidden after adding a user-agent , you may need to add more headers, such as referer : headers = { 'User-Agent': '...', 'referer': 'https://...' } The headers can be found in the Network > Headers > Request Headers of the Developer Tools. (Press F12 to toggle it.)

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.


1 Answers

HTTP Status Code 403 definitely means Forbidden / Access Denied.
HTTP Status Code 302 is for redirection of requests. No need to worry about them.
Nothing seems to be wrong in your code.

Yes, it's definitely an anti-scraping measure implemented by the site.

Refer these guidelines from Scrapy Docs: Avoid Getting Banned

Also, you should consider pausing and resuming crawls.

like image 161
Girish Avatar answered Oct 19 '22 21:10

Girish