Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I debug Scrapy?

I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'
like image 410
Gio Avatar asked Oct 20 '25 10:10

Gio


2 Answers

You can use pdb from the command line and add a breakpoint in your file. But it might involve some steps.

(It may differ slightly for windows debugging)

  1. Locate your scrapy executable:

    $ whereis scrapy
    /usr/local/bin/scrapy
    
  2. Call it as python script and start pdb

    $ python -m pdb /usr/local/bin/scrapy crawl quotes
    
  3. Once in the debugger shell, open another shell instance and locate the path to your spider script (residing in your spider project)

    $ realpath path/to/your/spider.py
    /absolute/spider/file/path.py
    

This will output the absolute path. Copy it to your clipboard.

  1. In the pdb shell type:

    b /absolute/spider/file/path.py:line_number
    

...where line number is the desired point to break when debugging that file.

  1. Hit c in the debugger...

Now go do some PythonFu :)

like image 193
deostroll Avatar answered Oct 22 '25 01:10

deostroll


Using VSCode:

1. Locate where your scrapy executable is:

$ which scrapy
/Users/whatever/tutorial/tutorial/env/bin/scrapy

For me it was at /Users/whatever/tutorial/tutorial/env/bin/scrapy, copy that path.

2. Create a launch.json file

Go to the debug tab in VSCode and click "Add configuration" enter image description here

3. Paste the following template into the launch.json

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "args": ["crawl", "NAME_OF_SPIDER"],
            "type": "python",
            "request": "launch",
            "program": "PATH_TO_SCRAPY_FILE",
            "console": "integratedTerminal",
            "justMyCode": false
        }
    ]
}

In that template replace NAME_OF_SPIDER with the name of your spider (in my case datasets). And PATH_TO_SCRAPY_FILE with the output which you got in step 1. (in my case /Users/whatever/tutorial/tutorial/env/bin/scrapy). enter image description here

4. Check that VSCode was opened at the root of your scrapy project

5. Set a breakpoint and click debug!

like image 32
Z0B Avatar answered Oct 22 '25 01:10

Z0B