I am trying to build a Django app that functions sort of like a store. Items are scraped from around the internet, and update the Django project database continuously over time (say every few days). I am using the Scrapy framework to perform scraping, and while there is an experimental DjangoItem feature, I would rather stay away from it because it is unstable.
Right now my plan is to create XML files of crawled items with Scrapy's XMLItemExporter
(docs here), and use those to loaddata
into the Django project as XML fixtures (docs here). This seems to be okay because if one of the two processes screws up, there is a file intermediary between them. Modularizing the application as a whole also doesn't seem like a bad idea.
Some concerns are:
loaddata
.The existance of the experimental DjangoItem suggests that Scrapy + Django is a popular enough choice for there to be a good solution here.
I would greatly appreciate any solutions, advice, or wisdom on this matter.
This question is a bit old already, but I'm currently dealing with proper integration of Django + Scrapy as well. My workflow is the following: I've set up Scrapy as a Django management command as described in this answer. Afterwards, I simply write a Scrapy pipeline that saves a scraped item into Django's database using Django's QuerySet methods. That's all. I'm currently using SQLite for the database and it works like a charm. Maybe this is still helpful for someone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With