Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Select specific link based on text

This should be easy but I'm stuck.

<div class="paginationControl">
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text 2</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=3&amp;powerunit=2">Link Text 3</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=4&amp;powerunit=2">Link Text 4</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=5&amp;powerunit=2">Link Text 5</a> |   

<!-- Next page link --> 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text Next ></a>
</div>

I'm trying to use Scrapy (Basespider) to select a link based on it's Link text using:

nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a/@href").re("(.+)*?Next")

For example, I want to select the next page link based on the fact that it's text is "Link Text Next". Any ideas?

like image 900
hoof_hearted Avatar asked Aug 27 '12 15:08

hoof_hearted


3 Answers

You can use the following XPath expression:

//div[@class='paginationControl']/a[text()="Link Text Next"]/@href

This selects the href attributes of the link with text "Link Text Next".

See XPath string functions if you need more control.

like image 82
icecrime Avatar answered Oct 05 '22 03:10

icecrime


Use a[contains(text(),'Link Text Next')]:

nextPage = HtmlXPathSelector(response).select(
    "//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href")

Reference: Documentation on the XPath contains function


PS. Your text Link Text Next has a space at the end. To avoid having to include that space in the code:

text()="Link Text Next "

I think using contains is a bit more general while still being specific enough.

like image 33
unutbu Avatar answered Oct 05 '22 05:10

unutbu


Your xpath is selecting the href not the text in the a tag. It doesn't look from your example like the href has next in it, so you can't find it with an RE.

like image 41
Andrew Cox Avatar answered Oct 05 '22 03:10

Andrew Cox