I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem
class DmozSpider(BaseSpider):
name = "namastecopy2"
allowed_domains = ["namastefoods.com"]
start_urls = [
"http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
"http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html/body/div/div[2]/table/tr/td[2]/table/tr')
items = []
for site in sites:
item = DmozItem()
item['manufacturer'] = 'Namaste Foods'
item['productname'] = site.select('td/h1/text()').extract()
item['description'] = site.select('//*[@id="info-col"]/p[7]/strong/text()').extract()
item['ingredients'] = site.select('td[1]/table/tr/td[2]/text()').extract()
item['ninfo'] = site.select('td[2]/ul/li[3]/img/@src').extract()
#insert code that will save the above image path for ninfo as an absolute path
base_url = get_base_url(response)
relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = urljoin_rfc(base_url, relative_url)
items.append(item)
return items
My items.py looks like this:
from scrapy.item import Item, Field
class DmozItem(Item):
# define the fields for your item here like:
productid = Field()
manufacturer = Field()
productname = Field()
description = Field()
ingredients = Field()
ninfo = Field()
imagename = Field()
image_paths = Field()
relative_images = Field()
image_urls = Field()
pass
I need the relative paths that the spider is getting for items['relative_images'] converted to absolute paths & saved in items['image_urls'] so that I can download the images from within this spider itself. For example, the relative_images path that the spider fetches is '../../files/images/small/8270-BrowniesHiResClip.jpg', this should be converted to 'http://namastefoods.com/files/images/small/8270-BrowniesHiResClip.jpg', & stored in items['image_urls']
I also will need the items['ninfo'] path to be stores as an absolute path.
Error when running the above code:
2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
item['image_urls'] = urljoin_rfc(base_url, relative_url)
File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
unicode_to_str(ref, encoding))
File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list
2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
item['image_urls'] = urljoin_rfc(base_url, relative_url)
File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
unicode_to_str(ref, encoding))
File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list
2 011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)
Thanks.-TM
Use abspath() to Get the Absolute Path in Pythonpath property. To get the absolute path using this module, call path. abspath() with the given path to get the absolute path. The output of the abspath() function will return a string value of the absolute path relative to the current working directory.
The absolutePath function works by beginning at the starting folder and moving up one level for each "../" in the relative path. Then it concatenates the changed starting folder with the relative path to produce the equivalent absolute path.
Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. Absolute links are good when you want to send the user to a page that is outside of your server.
isabs() method in Python is used to check whether the specified path is an absolute path or not. On Unix platforms, an absolute path begins with a forward slash ('/') and on Windows it begins with a backward slash ('\') after removing any potential drive letter.
From Scrapy docs:
def parse(self, response):
# ... code ommited
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, self.parse)
that is, response
object has a method to do exactly this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With