Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract href in table with importxml in Google spreadsheet

I am trying to pull the href for each row of each table from this website:

http://www.epa.gov/region4/superfund/sites/sites.html#KY

I can pull the table information off using =IMPORTHTML(A1,"table",1) for all 7 tables, but I need the href to the site with the detailed information.

Using =IMPORTxml(A1,"//div[@class='box']") I can pull the information needed from a site like:

http://www.epa.gov/region4/superfund/sites/fedfacs/alarmyaplal.html

but I need to extract the fedfacs/alarmyaplal.html portion for each row on the original page.

I've tried using //@href, but it is not returning any results. I'm thinking it is because the data is structured in a table but I'm stuck on where to go from here.

like image 484
Slocke04 Avatar asked Mar 09 '26 23:03

Slocke04


1 Answers

I'm not sure about any of the Google Spreadsheet functionality, but here's an XPath to select all href attributes of the Kentucky sites (since your first link included the 'ky' anchor):

//body//a[@id='ky']/following-sibling::table[1]/tbody/tr/td[1]/strong/a/@href

This is very specific to the Kentucky table: following-sibling::table[1] means the first table node after, and at the same level of, a[@id='ky'].

like image 198
Zach Young Avatar answered Mar 12 '26 14:03

Zach Young