Django Relations with Scrapy how are items saved?

Question

I just need to understand How can I detect whether scrapy saved and item in spider ? I'm fetching items from a site and after that I'm fetching comments on that item. So first I have to save the item after that I'll save comments. But when I'm writing code after yield it's giving me this error.

save() prohibited to prevent data loss due to unsaved related object ''.

And this is my code

def parseProductComments(self, response):

        name = response.css('h1.product-name::text').extract_first()
        price = response.css('span[id=offering-price] > span::text').extract_first()
        node = response.xpath("//script[contains(text(),'var utagData = ')]/text()")
        data = node.re('= (\{.+\})')[0]  #data = xpath.re(" = (\{.+\})")
        data = json.loads(data)

        barcode = data['product_barcode']

        objectImages = []
        for imageThumDiv in response.css('div[id=productThumbnailsCarousel]'):
            images = imageThumDiv.xpath('img/@data-src').extract()
            for image in images:
                imageQuality = image.replace('/80/', '/500/')
                objectImages.append(imageQuality)
        company = Company.objects.get(pk=3)
        comments = []
        item = ProductItem(name=name, price=price, barcode=barcode, file_urls=objectImages, product_url=response.url,product_company=company, comments = comments)
        yield item
        print item["pk"]
        for commentUl in response.css('ul.chevron-list-container'):

            url = commentUl.css('span.link-more-results::attr(href)').extract_first()
            if url is not None:
                for commentLi in commentUl.css('li.review-item'):
                    comment = commentLi.css('p::text').extract_first()
                    commentItem = CommentItem(comment=comment, product=item.instance)

                    yield commentItem
            else:

                yield scrapy.Request(response.urljoin(url), callback=self.parseCommentsPages, meta={'item': item.instance})

And this is my pipeline.

def comment_to_model(item):
    model_class = getattr(item, 'Comment')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

def get_comment_or_create(model):
    model_class = type(model)
    created = False
    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(product=model.product, comment=model.comment)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.
        obj.save()

    return (obj, created)

def get_or_create(model):
    model_class = type(model)
    created = False
    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(product_company=model.product_company, barcode=model.barcode)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.
        obj.save()

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()
    return destination


class ProductItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, ProductItem):
            item['cover_photo'] = item['files'][0]['path']
            item_model = item.instance
            model, created = get_or_create(item_model)
            #update_model(model, item_model)

            if created:
                for image in item['files']:
                    imageItem = ProductImageItem(image=image['path'], product=item.instance)
                    imageItem.save()
                # for comment in item['comments']:
                #     commentItem = CommentItem(comment=comment, product= item.instance)
                #     commentItem.save()
            return item
        if isinstance(item, CommentItem):
            comment_to_model = item.instance
            model, created = get_comment_or_create(comment_to_model)
            if created:
                print model
            else:
                print created
            return item

e4c5 · Accepted Answer

Get or Create

A large part of your code seems to deal with an apparent weakness of get_or_create

# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.

Fortunately this apparent short coming can be easily overcome. Thanks to the default parameter of get_or_create

Any keyword arguments passed to get_or_create() — except an optional one called defaults — will be used in a get() call. If an object is found, get_or_create() returns a tuple of that object and False. If multiple objects are found, get_or_create raises MultipleObjectsReturned. If an object is not found, get_or_create() will instantiate and save a new object, returning a tuple of the new object and True.

Update or Create

Still not convinced that get_or_create is the right man for the job? I am not either. There is something even better. update_or_create!!

A convenience method for updating an object with the given kwargs, creating a new one if necessary. The defaults is a dictionary of (field, value) pairs used to update the object.

But I am not going to dwell on the user of update_or_create because lines in your code that attempt to update your model have been commented out and you have not clearly state what you want to update.

The new pipeline

Using the standard API methods, your module that contains your pipeline just reduces to the ProductItemPipeline class. And that can be modified

class ProductItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, ProductItem):
            item['cover_photo'] = item['files'][0]['path']

            model, created = ProductItem.get_or_create(product_company=item['product_company'], barcode=item['bar_code'], 
    defaults={'Other_field1': value1, 'Other_field2': value2})

            if created:
                for image in item['files']:
                    imageItem = ProductImageItem(image=image['path'], product=item.instance)
                    imageItem.save()
            return item

        if isinstance(item, CommentItem):

            model, created = CommentItem.get_or_create(field1=value1, defaults={ other fields go in here'})

            if created:
                print model
            else:
                print created
            return item

Bug in the original code

I do believe this to be the place where the bug existed.

  obj = model_class.objects.get(product=model.product, comment=model.comment)

Now we are not using that so the bug should disappear. If you still have problems please paste the full traceback.

Django Relations with Scrapy how are items saved?

Tags:

python

django

scrapy

scrapy-spider

scrapy-pipeline

Murat Kaya

1 Answers

Get or Create

Update or Create

The new pipeline

Bug in the original code

e4c5

Recent Activity

Donate For Us

Django Relations with Scrapy how are items saved?

Tags:

python

django

scrapy

scrapy-spider

scrapy-pipeline

Murat Kaya

1 Answers

Get or Create

Update or Create

The new pipeline

Bug in the original code

e4c5

Related questions

Recent Activity

Donate For Us