Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract links from html table

I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
tableNodes<-getNodeSet(sitePage,"//table")
siteTable<-readHTMLTable(tableNodes[[1]])

However the links are missing after I use the readHTML command.

like image 719
Jorge Velasquez Avatar asked Sep 05 '12 22:09

Jorge Velasquez


People also ask

How do I show links in a table in HTML?

You just put an img tag inside the table cell (< td > tag). The src attribute's value can be any valid URL of an image on the Web , local or remote.

How do you extract information from a table?

To extract values from a table, use curly braces. If you extract values from multiple table variables, then the variables must have data types that allow them to be concatenated together. Create a table from numeric and logical arrays from the patients file.


1 Answers

It ended up being an intricate XPath expression:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")

but let me explain the XPath expression bit-by-bit:

  • //table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'

  • //td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen

  • /preceding-sibling -> Now we start looking backwards

  • ::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.

  • /a -> now get the included a node

  • /@href -> and finally more precisely the href attribute content

like image 144
dimitrisli Avatar answered Nov 01 '22 11:11

dimitrisli