HTTP 403 Responses when using Python Scrapy

Tags:

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics:

Click to copy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

The code is executing without any errors, however of the 4623 pages scraped, 217 got a HTTP response code of 200, 2 got a code of 302 and 4404 got a 403 response. Can anyone see anything immediately obvious in the code as to why this might be? Could this be an anti Scraping measure from the site? Is it usual practice to slow the number of submissions made to stop this happening?

Thanks

611

asked Jul 17 '14 21:07

gdogg371

1 Answers

HTTP Status Code 403 definitely means Forbidden / Access Denied.
HTTP Status Code 302 is for redirection of requests. No need to worry about them.
Nothing seems to be wrong in your code.

Yes, it's definitely an anti-scraping measure implemented by the site.

Refer these guidelines from Scrapy Docs: Avoid Getting Banned

Also, you should consider pausing and resuming crawls.

161

answered Oct 19 '22 21:10

Girish

Related questions
                            
                                Python: Split, strip, and join in one line
                            
                                Why urllib.urlopen.read() does not correspond to source code?
                            
                                Why does a chained dictionary .get() in python return a tuple when the default provided is not a tuple?
                            
                                How to extract numbers from filename in Python?
                            
                                Keeping only certain characters in a string using Python?
                            
                                Convert a number enclosed in parentheses (string) to a negative integer (or float) using Python?
                            
                                bufsize must be an integer error while grepping a message
                            
                                how to disable the window maximize icon using PyQt4?
                            
                                How do I convert dates into ISO-8601 DateTime format in a Pandas dataframe
                            
                                overloading unittest.testcase in python
                            
                                Find the longest substring in alphabetical order
                            
                                How to parse json with ijson and python
                            
                                What is the time complexity of sum() in Python? [closed]
                            
                                python convert a string to arguments list
                            
                                How to install delete-project plugin in gerrit?
                            
                                How to catch exceptions in workers in Multiprocessing
                            
                                piping from stdin to a python code in a bash script
                            
                                clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
                            
                                How to output RandomForest Classifier from python?
                            
                                How to avoid for-in looping over None in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HTTP 403 Responses when using Python Scrapy

Tags:

python

http

scrapy

gdogg371

People also ask

1 Answers

Girish

Recent Activity

Donate For Us