I just need to understand How can I detect whether scrapy saved and item in spider ? I'm fetching items from a site and after that I'm fetching comments on that item. So first I have to save the item after that I'll save comments. But when I'm writing code after yield it's giving me this error.
save() prohibited to prevent data loss due to unsaved related object ''.
And this is my code
def parseProductComments(self, response):
name = response.css('h1.product-name::text').extract_first()
price = response.css('span[id=offering-price] > span::text').extract_first()
node = response.xpath("//script[contains(text(),'var utagData = ')]/text()")
data = node.re('= (\{.+\})')[0] #data = xpath.re(" = (\{.+\})")
data = json.loads(data)
barcode = data['product_barcode']
objectImages = []
for imageThumDiv in response.css('div[id=productThumbnailsCarousel]'):
images = imageThumDiv.xpath('img/@data-src').extract()
for image in images:
imageQuality = image.replace('/80/', '/500/')
objectImages.append(imageQuality)
company = Company.objects.get(pk=3)
comments = []
item = ProductItem(name=name, price=price, barcode=barcode, file_urls=objectImages, product_url=response.url,product_company=company, comments = comments)
yield item
print item["pk"]
for commentUl in response.css('ul.chevron-list-container'):
url = commentUl.css('span.link-more-results::attr(href)').extract_first()
if url is not None:
for commentLi in commentUl.css('li.review-item'):
comment = commentLi.css('p::text').extract_first()
commentItem = CommentItem(comment=comment, product=item.instance)
yield commentItem
else:
yield scrapy.Request(response.urljoin(url), callback=self.parseCommentsPages, meta={'item': item.instance})
And this is my pipeline.
def comment_to_model(item):
model_class = getattr(item, 'Comment')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")
def get_comment_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product=model.product, comment=model.comment)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()
return (obj, created)
def get_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product_company=model.product_company, barcode=model.barcode)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()
return (obj, created)
def update_model(destination, source, commit=True):
pk = destination.pk
source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination
class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']
item_model = item.instance
model, created = get_or_create(item_model)
#update_model(model, item_model)
if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
# for comment in item['comments']:
# commentItem = CommentItem(comment=comment, product= item.instance)
# commentItem.save()
return item
if isinstance(item, CommentItem):
comment_to_model = item.instance
model, created = get_comment_or_create(comment_to_model)
if created:
print model
else:
print created
return item
A large part of your code seems to deal with an apparent weakness of get_or_create
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
Fortunately this apparent short coming can be easily overcome. Thanks to the default parameter of get_or_create
Any keyword arguments passed to get_or_create() — except an optional one called defaults — will be used in a get() call. If an object is found, get_or_create() returns a tuple of that object and False. If multiple objects are found, get_or_create raises MultipleObjectsReturned. If an object is not found, get_or_create() will instantiate and save a new object, returning a tuple of the new object and True.
Still not convinced that get_or_create is the right man for the job? I am not either. There is something even better. update_or_create!!
A convenience method for updating an object with the given kwargs, creating a new one if necessary. The defaults is a dictionary of (field, value) pairs used to update the object.
But I am not going to dwell on the user of update_or_create because lines in your code that attempt to update your model have been commented out and you have not clearly state what you want to update.
Using the standard API methods, your module that contains your pipeline just reduces to the ProductItemPipeline class. And that can be modified
class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']
model, created = ProductItem.get_or_create(product_company=item['product_company'], barcode=item['bar_code'],
defaults={'Other_field1': value1, 'Other_field2': value2})
if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
return item
if isinstance(item, CommentItem):
model, created = CommentItem.get_or_create(field1=value1, defaults={ other fields go in here'})
if created:
print model
else:
print created
return item
I do believe this to be the place where the bug existed.
obj = model_class.objects.get(product=model.product, comment=model.comment)
Now we are not using that so the bug should disappear. If you still have problems please paste the full traceback.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With