I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.
My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.
I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?
Currently, I have the output structured with json as
{[name='name',age='age',occupation='occupation'],
[name='name',age='age',occupation='occupation']} etc
Can the parse function reach across pages like that?
here is a way you need to deal. you need to yield/return item once when item has all attributes
yield Request(page1,
callback=self.page1_data)
def page1_data(self, response):
hxs = HtmlXPathSelector(response)
i = TestItem()
i['name']='name'
i['age']='age'
url_profile_page = 'url to the profile page'
yield Request(url_profile_page,
meta={'item':i},
callback=self.profile_page)
def profile_page(self,response):
hxs = HtmlXPathSelector(response)
old_item=response.request.meta['item']
# parse other fileds
# assign them to old_item
yield old_item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With