To further elaborate on the heading for this question: I am scraping info off of movie website. I currently have a MySQL
database populated with movie titles
, movie urls
, etc. I'm now going to take those urls
from the database and set them as my start_urls
within a new spider
. Each url
is a link to [insert abritrary movie]'s webpage, much more information is conveyed. The information I'm interested in is:
Of these, distributor, rating, director and genre will have one 'thing' associated with them from each movie webpage (one rating, one director, etc). There will be multiple, of course, multiple actors and, depending, multiple producers (bigger name films/most films). This is where I'm having an issue. I want to establish a pipeline' which puts each piece of info in an appropriate
tablewithin my
MySQLdatabase. So, a table for director, a table for rating, etc. Each table will also have
movie title`. I can state the problem itself thusly:
I'm having trouble reconciling how to construct an appropriate pipeline
with an appropriate spider
. I'm not sure whether I can return multiple things from one spider and send them to different pipelines
(create different items to deal with single
attributes, and a different item to deal with 'multiple' attributes) or whether to use the same pipeline and somehow specify what goes where (not sure if I can only return one thing after scraping). I will show my code and hopefully the issue will become clearer. *Note: it is not yet complete- I'm just trying to fill in the blanks with how to do this
Spider:
class ActorSpider(BaseSpider):
import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
import MySQLdb
db = MySQLdb.connect(db = 'testdb', user='testuser', passwd='test')
dbc = db.cursor()
name = 'ActorSpider'
allowed_domains = ['movie website']
#start_urls = #HAVE NOT FILLED THIS IN YET- WILL BE A SELECT STATEMENT, GATHERING ALL URLS
def parse(self, response):
hxs = HtmlXPathSelector(response)
#Expect only singular items (ie. one title, one rating, etc.)
single_info = SingleItem()
title = hxs.select('[title tags here]').extract()
distributor = hxs.select('[distributor tags here]').extract()
rating = hxs.select('[rating tags here]').extract()
director = hxs.select('[director tags here]').extract()
genre = hxs.select('[genre tags here]').extract()
single_items = []
single_info['title'] = title
single_info['distributor'] = distributor
single_info['rating'] = rating
single_info['director'] = director
single_info['genre'] = genre
single_items.append(single_info) #Note: not sure if I want to return this or the single info
#return single_items
#Multiple items in a field
multi_info = MultiItem()
actors = hxs.select('[actor tags here]').extract()
producers = hxs.select('[producer tags here]').extract()
actor_items= []
for i in range(len(actors)):
multi_info['title'] = title
multi_info['actor'] = actors[i]
actor_items.append(multi_info)
#return actor_items - can I have multiple returns in my code to specify which pipeline is used, or which table this should be inserted into
producer_items = []
for i in range(len(producers)):
multi_info['title'] = title
multi_info['producer'] = producers[i]
producer_items.append(multi_info)
#return producer_items - same issue - are multiple returns allowed? Should I try to put both the 'single items' and 'multiple items' in on big 'items' list? Can scrapy figure that out or how would I go about specifying?
I've commented in a number of questions which may be unclear- I'm not sure how to direct everything so that it ends up in the appropriate table. This may be more clear when you read the pipeline, which is:
class IndMoviePipeline(object):
def __init__(self):
'initiate the database connnection'
self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
if 'producer' in item:
self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
elif 'actor' in item:
self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
else:
self.cursor.execute("""INSERT INTO Other_Info (title, distributor, rating, director, genre) VALUES (%s, %s, %s, %s, %s)""", (item['title'], item['distributor'], item['rating'], item['director'], item['genre'])) #NOTE: I will likely change 'Other_Info' table to just populating the original table from which the URLS will be pulled
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
I think that will work to direct the item
to the appropriate table
within the database. Based on this, I think it would work to have one big list of items
and append everything to it, so:
items = []
items.append(single_info)
for i in range(len(producers)):
multi_info['title'] = title
multi_info['producer'] = producers[i]
items.append(multi_info)
for i in range(len(actors)):
multi_info['title'] = title
multi_info['actor'] = actors[i]
items.append(multi_info)
Just letting the pipeline
sort this all out with those if
statements. I'm not sure, though, if this is the best way to do this and would really appreciate suggestions.
Conceptually, scrapy Items generally refer to a single "thing" being scraped (in your case, a movie) and have Fields that represent the data that make up this "thing". So consider having:
class MovieItem(scrapy.item.Item):
title = Field()
director = Field()
actors = Field()
Then when you scrape the items:
item = MovieItem()
title = hxs.select('//some/long/xpath').extract()
item['title'] = title
actors = hxs.select('//some/long/xpath').extract()
item['actors'] = actors
return item
Spider parse methods should always return or yield either scrapy.item.Item objects or scrapy.http.Request objects.
From there, how you process the MovieItems is up to you. You could have a pipeline for each property of the MovieItem, but it's not recommended. What I would recommend instead is having a single MySQLPersistancePipeline object which has methods for persisting each of the fields of the MovieItem. So something like:
class MySQLPersistancePipeline(object):
...
def persist_producer(self, item):
self.cursor.execute('insert into producers ...', item['producer'])
def persist_actors(self, item):
for actor in item['actors']:
self.cursor.execute('insert into actors ...', actor)
def process_item(self, item, spider):
persist_producer(item)
persist_actors(item)
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With