I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code: <pre class="prettyprint"><code>library(XML) sitePage<-htmlParse("http://ipt.humboldt.org.co/") tableNodes<-getNodeSet(sitePage,"//table") siteTable<-readHTMLTable(tableNodes[[1]]) </code></pre> However the links are missing after I use the readHTML command.

It ended up being an intricate XPath expression: <pre class="prettyprint"><code>library(XML) sitePage<-htmlParse("http://ipt.humboldt.org.co/") hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable'] //td[5][.='Specimen'] /preceding-sibling ::td[3] /a /@href") </code></pre> but let me explain the XPath expression bit-by-bit: <ul> <li><code>//table[@id='resourcestable']</code> -> This way we are getting the main table on the page called 'resourcestable'</li> <li><code>//td[5][.='Specimen']</code> -> Now we are filtering only these rows that have Type as Specimen</li> <li><code>/preceding-sibling</code> -> Now we start looking backwards</li> <li><code>::td[3]</code> -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.</li> <li><code>/a</code> -> now get the included a node</li> <li><code>/@href</code> -> and finally more precisely the href attribute content</li> </ul>

Extract links from html table

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
tableNodes<-getNodeSet(sitePage,"//table")
siteTable<-readHTMLTable(tableNodes[[1]])

However the links are missing after I use the readHTML command.

719

asked Sep 05 '12 22:09

Jorge Velasquez

1 Answers

It ended up being an intricate XPath expression:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")

but let me explain the XPath expression bit-by-bit:

//table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'
//td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen
/preceding-sibling -> Now we start looking backwards
::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.
/a -> now get the included a node
/@href -> and finally more precisely the href attribute content

144

answered Nov 01 '22 11:11

dimitrisli

Related questions
                            
                                A clickable <li> using an <a> tag - no JS to be used. Is it legal HTML?
                            
                                Which doctype should I use?
                            
                                How can I target a div inside an iframe?
                            
                                Wrapping HTML in an app for Android
                            
                                Set hidden input value in Selenium?
                            
                                set tabindex for button not working
                            
                                <input> has mysterious bottom padding
                            
                                Adding an image from a url - html
                            
                                signalR vs html5 websockets for asp.net MVC chat application
                            
                                Simple Javascript highlighting in a text area?
                            
                                text field cursor issue in chrome
                            
                                Android : Html Anchor Link works only once in webview
                            
                                How to make chrome extension to be in full screen?
                            
                                How to retrieve sorting status of JQuery Datatables
                            
                                Button vs link vs input type="submit" on a form
                            
                                Reading and writing from localStorage?
                            
                                Change CSS with javascript using getElementById
                            
                                Putting placeholder attribute on file type input field
                            
                                Draw a curved line on a webpage as the user scrolls
                            
                                Is it possible to access the file creation time of an <img> src in javascript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract links from html table

Tags:

html

r

xml

web-scraping

Jorge Velasquez

People also ask

1 Answers

dimitrisli

Recent Activity

Donate For Us