Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to collect data from multiple pages into single data structure with scrapy

I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.

My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.

I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?

Currently, I have the output structured with json as

   {[name='name',age='age',occupation='occupation'],
   [name='name',age='age',occupation='occupation']} etc

Can the parse function reach across pages like that?

like image 759
user2071236 Avatar asked Feb 14 '13 08:02

user2071236


1 Answers

here is a way you need to deal. you need to yield/return item once when item has all attributes

yield Request(page1,
              callback=self.page1_data)

def page1_data(self, response):
    hxs = HtmlXPathSelector(response)
    i = TestItem()
    i['name']='name'
    i['age']='age'
    url_profile_page = 'url to the profile page'

    yield Request(url_profile_page,
                  meta={'item':i},
    callback=self.profile_page)


def profile_page(self,response):
    hxs = HtmlXPathSelector(response)
    old_item=response.request.meta['item']
    # parse other fileds
    # assign them to old_item

    yield old_item
like image 84
akhter wahab Avatar answered Oct 23 '22 04:10

akhter wahab