Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Suggestion for Multiple Returns/Items to Database

To further elaborate on the heading for this question: I am scraping info off of movie website. I currently have a MySQL database populated with movie titles, movie urls, etc. I'm now going to take those urls from the database and set them as my start_urls within a new spider. Each url is a link to [insert abritrary movie]'s webpage, much more information is conveyed. The information I'm interested in is:

  • distributor (ie. Fox)
  • rating (ie. Pg-13)
  • director
  • genre (ie. comedy)
  • actors
  • producer/s

Of these, distributor, rating, director and genre will have one 'thing' associated with them from each movie webpage (one rating, one director, etc). There will be multiple, of course, multiple actors and, depending, multiple producers (bigger name films/most films). This is where I'm having an issue. I want to establish a pipeline' which puts each piece of info in an appropriatetablewithin myMySQLdatabase. So, a table for director, a table for rating, etc. Each table will also havemovie title`. I can state the problem itself thusly:

I'm having trouble reconciling how to construct an appropriate pipeline with an appropriate spider. I'm not sure whether I can return multiple things from one spider and send them to different pipelines (create different items to deal with single attributes, and a different item to deal with 'multiple' attributes) or whether to use the same pipeline and somehow specify what goes where (not sure if I can only return one thing after scraping). I will show my code and hopefully the issue will become clearer. *Note: it is not yet complete- I'm just trying to fill in the blanks with how to do this

Spider:

  class ActorSpider(BaseSpider):
  import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
  import MySQLdb
  db = MySQLdb.connect(db = 'testdb', user='testuser', passwd='test')
  dbc = db.cursor()
  name = 'ActorSpider'
  allowed_domains = ['movie website']
  #start_urls = #HAVE NOT FILLED THIS IN YET- WILL BE A SELECT STATEMENT, GATHERING ALL URLS

  def parse(self, response):

      hxs = HtmlXPathSelector(response)

      #Expect only singular items (ie. one title, one rating, etc.)

      single_info = SingleItem()
      title = hxs.select('[title tags here]').extract()
      distributor = hxs.select('[distributor tags here]').extract()
      rating = hxs.select('[rating tags here]').extract()
      director = hxs.select('[director tags here]').extract()
      genre = hxs.select('[genre tags here]').extract()

      single_items = []
      single_info['title'] = title
      single_info['distributor'] = distributor
      single_info['rating'] = rating
      single_info['director'] = director
      single_info['genre'] = genre        
      single_items.append(single_info) #Note: not sure if I want to return this or the single info

      #return single_items


      #Multiple items in a field

      multi_info = MultiItem()
      actors = hxs.select('[actor tags here]').extract()
      producers = hxs.select('[producer tags here]').extract()

      actor_items= []
      for i in range(len(actors)):
          multi_info['title'] = title
          multi_info['actor'] = actors[i]
          actor_items.append(multi_info)

     #return actor_items - can I have multiple returns in my code to specify which pipeline is used, or which table this should be inserted into

      producer_items = []
      for i in range(len(producers)):
          multi_info['title'] = title
          multi_info['producer'] = producers[i]
          producer_items.append(multi_info)
      #return producer_items - same issue - are multiple returns allowed? Should I try to put both the 'single items' and 'multiple items' in on big 'items' list?  Can scrapy figure that out or how would I go about specifying?

I've commented in a number of questions which may be unclear- I'm not sure how to direct everything so that it ends up in the appropriate table. This may be more clear when you read the pipeline, which is:

 class IndMoviePipeline(object):

     def __init__(self):
        'initiate the database connnection'
        self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

         try:
             if 'producer' in item:
                  self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
             elif 'actor' in item:
                  self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
             else:
                  self.cursor.execute("""INSERT INTO Other_Info (title, distributor, rating, director, genre) VALUES (%s, %s, %s, %s, %s)""", (item['title'], item['distributor'], item['rating'], item['director'], item['genre'])) #NOTE: I will likely change 'Other_Info' table to just populating the original table from which the URLS will be pulled
             self.conn.commit()
         except MySQLdb.Error, e:
             print "Error %d: %s" % (e.args[0], e.args[1])

         return item

I think that will work to direct the item to the appropriate table within the database. Based on this, I think it would work to have one big list of items and append everything to it, so:

 items = []
 items.append(single_info)

 for i in range(len(producers)):
      multi_info['title'] = title
      multi_info['producer'] = producers[i]
      items.append(multi_info)

 for i in range(len(actors)):
      multi_info['title'] = title
      multi_info['actor'] = actors[i]
      items.append(multi_info)

Just letting the pipeline sort this all out with those if statements. I'm not sure, though, if this is the best way to do this and would really appreciate suggestions.

like image 921
DMML Avatar asked Aug 21 '13 17:08

DMML


1 Answers

Conceptually, scrapy Items generally refer to a single "thing" being scraped (in your case, a movie) and have Fields that represent the data that make up this "thing". So consider having:

class MovieItem(scrapy.item.Item):
  title = Field()
  director = Field()
  actors = Field()

Then when you scrape the items:

item = MovieItem()

title = hxs.select('//some/long/xpath').extract()
item['title'] = title

actors = hxs.select('//some/long/xpath').extract()
item['actors'] = actors

return item

Spider parse methods should always return or yield either scrapy.item.Item objects or scrapy.http.Request objects.

From there, how you process the MovieItems is up to you. You could have a pipeline for each property of the MovieItem, but it's not recommended. What I would recommend instead is having a single MySQLPersistancePipeline object which has methods for persisting each of the fields of the MovieItem. So something like:

class MySQLPersistancePipeline(object):
  ...
  def persist_producer(self, item):
    self.cursor.execute('insert into producers ...', item['producer'])

  def persist_actors(self, item):
    for actor in item['actors']:
      self.cursor.execute('insert into actors ...', actor)

  def process_item(self, item, spider):
    persist_producer(item)
    persist_actors(item)
    return item
like image 126
audiodude Avatar answered Oct 12 '22 23:10

audiodude