Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to continuously export information from a Scrapy crawler to a Django application database? [duplicate]

I am trying to build a Django app that functions sort of like a store. Items are scraped from around the internet, and update the Django project database continuously over time (say every few days). I am using the Scrapy framework to perform scraping, and while there is an experimental DjangoItem feature, I would rather stay away from it because it is unstable.

Right now my plan is to create XML files of crawled items with Scrapy's XMLItemExporter (docs here), and use those to loaddata into the Django project as XML fixtures (docs here). This seems to be okay because if one of the two processes screws up, there is a file intermediary between them. Modularizing the application as a whole also doesn't seem like a bad idea.

Some concerns are:

  • That these files might be too large to read into memory for Django's loaddata.
  • That I am spending too much time on this when there might be a better or easier solution, such as exporting directly to the database, which is MySQL in this case.
  • No one seems to have written about this process online, which is strange considering Scrapy is an excellent framework to plug into a Django app in my opinion.
  • There is no definitive guide of manually creating Django fixtures on Django's docs - it seems like it is geared more towards the dumping and reloading of fixtures from the app itself.

The existance of the experimental DjangoItem suggests that Scrapy + Django is a popular enough choice for there to be a good solution here.

I would greatly appreciate any solutions, advice, or wisdom on this matter.

like image 792
emish Avatar asked Jul 29 '11 20:07

emish


1 Answers

This question is a bit old already, but I'm currently dealing with proper integration of Django + Scrapy as well. My workflow is the following: I've set up Scrapy as a Django management command as described in this answer. Afterwards, I simply write a Scrapy pipeline that saves a scraped item into Django's database using Django's QuerySet methods. That's all. I'm currently using SQLite for the database and it works like a charm. Maybe this is still helpful for someone.

like image 192
pemistahl Avatar answered Oct 11 '22 05:10

pemistahl