I'm 99% sure something is going on with my hxs.select
on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title
or link
doesn't get populated. Any help?
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class=\'footer\']')
items = []
for site in sites:
item = CarrierItem()
item['title'] = site.select('.//a/text()').extract()
item['link'] = site.select('.//a/@href').extract()
items.append(item)
return items
Is there a way I can debug this? I also tried to use the scrapy shell
command with an url but when I input view(response)
in the shell it simply returns True
and a text file opens instead of my Web Browser.
>>> response.url 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp' >>> hxs.select('//div') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select' >>> view(response) True >>> hxs.select('//body') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select'
You can use pdb from the command line and add a breakpoint in your file. But it might involve some steps.
(It may differ slightly for windows debugging)
Locate your scrapy
executable:
$ whereis scrapy
/usr/local/bin/scrapy
Call it as python script and start pdb
$ python -m pdb /usr/local/bin/scrapy crawl quotes
Once in the debugger shell, open another shell instance and locate the path to your spider script (residing in your spider project)
$ realpath path/to/your/spider.py
/absolute/spider/file/path.py
This will output the absolute path. Copy it to your clipboard.
In the pdb shell type:
b /absolute/spider/file/path.py:line_number
...where line number is the desired point to break when debugging that file.
c
in the debugger...Now go do some PythonFu :)
$ which scrapy
/Users/whatever/tutorial/tutorial/env/bin/scrapy
For me it was at /Users/whatever/tutorial/tutorial/env/bin/scrapy
, copy that path.
Go to the debug tab in VSCode and click "Add configuration"
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"args": ["crawl", "NAME_OF_SPIDER"],
"type": "python",
"request": "launch",
"program": "PATH_TO_SCRAPY_FILE",
"console": "integratedTerminal",
"justMyCode": false
}
]
}
In that template replace NAME_OF_SPIDER
with the name of your spider (in my case datasets
). And PATH_TO_SCRAPY_FILE
with the output which you got in step 1. (in my case /Users/whatever/tutorial/tutorial/env/bin/scrapy
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With