Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: XPath expression returns links outside of selected element

Tags:

r

xpath

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.

My code follows:

require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href"))
(links = unique(z))

If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.

What am I doing wrong? What is the correct/more efficient way to code this with XPath?

Note: XPath novice writing.

Answered (really quickly), thanks very much! My solution is below.

extract <- function(x) {
    message(x)
    html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
    html = xpathApply(html, "//table")[[3]]
    html = xpathApply(html, ".//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")
    html = gsub("#ac_newscomment", "", html)
    html = unique(html)
}

d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)

This saves all links to news items with keyword 'Hadopi' on this website.

like image 277
Fr. Avatar asked May 18 '13 19:05

Fr.


1 Answers

You need to start the pattern with . if you want to restrict the search to the current node. / goes back to the start of the document (even if the root node is not in y).

xpathSApply(y, ".//a/@href" )

Alternatively, you can extract the third table directly with XPath:

xpathApply(x, "//table[3]//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")
like image 152
Vincent Zoonekynd Avatar answered Sep 30 '22 19:09

Vincent Zoonekynd