Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy: Convert relative paths to absolute paths

Tags:

I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem

class DmozSpider(BaseSpider):
   name = "namastecopy2"
   allowed_domains = ["namastefoods.com"]
   start_urls = [
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",    

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html/body/div/div[2]/table/tr/td[2]/table/tr')
    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        item['productname'] = site.select('td/h1/text()').extract()
        item['description'] = site.select('//*[@id="info-col"]/p[7]/strong/text()').extract()
        item['ingredients'] = site.select('td[1]/table/tr/td[2]/text()').extract()
        item['ninfo'] = site.select('td[2]/ul/li[3]/img/@src').extract()
        #insert code that will save the above image path for ninfo as an absolute path
        base_url = get_base_url(response)
        relative_url = site.select('//*[@id="showImage"]/@src').extract()
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
        items.append(item)
    return items

My items.py looks like this:

from scrapy.item import Item, Field

class DmozItem(Item):
    # define the fields for your item here like:
    productid = Field()
    manufacturer = Field()
    productname = Field()
    description = Field()
    ingredients = Field()
    ninfo = Field()
    imagename = Field()
    image_paths = Field()
    relative_images = Field()
    image_urls = Field()
    pass

I need the relative paths that the spider is getting for items['relative_images'] converted to absolute paths & saved in items['image_urls'] so that I can download the images from within this spider itself. For example, the relative_images path that the spider fetches is '../../files/images/small/8270-BrowniesHiResClip.jpg', this should be converted to 'http://namastefoods.com/files/images/small/8270-BrowniesHiResClip.jpg', & stored in items['image_urls']

I also will need the items['ninfo'] path to be stores as an absolute path.

Error when running the above code:

2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2    011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)

Thanks.-TM

like image 435
user818190 Avatar asked Jun 27 '11 22:06

user818190


People also ask

How do you convert relative path to absolute path in Python?

Use abspath() to Get the Absolute Path in Pythonpath property. To get the absolute path using this module, call path. abspath() with the given path to get the absolute path. The output of the abspath() function will return a string value of the absolute path relative to the current working directory.

How do you convert to absolute path?

The absolutePath function works by beginning at the starting folder and moving up one level for each "../" in the relative path. Then it concatenates the changed starting folder with the relative path to produce the equivalent absolute path.

Why is it better to use relative paths instead of absolute paths?

Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. Absolute links are good when you want to send the user to a page that is outside of your server.

How do you know if a path is absolute or relative in Python?

isabs() method in Python is used to check whether the specified path is an absolute path or not. On Unix platforms, an absolute path begins with a forward slash ('/') and on Windows it begins with a backward slash ('\') after removing any potential drive letter.


1 Answers

From Scrapy docs:

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

that is, response object has a method to do exactly this.

like image 93
NefariousOctopus Avatar answered Nov 04 '22 08:11

NefariousOctopus