extract data from raw html in R

Question

I am trying to extract the values of all the values in all tabs from this page. http://www.imd.gov.in/section/hydro/dynamic/rfmaps/weekrain.htm

I first tried downloading as excel. But that was not possible. I am just able to download it as text file. If I try reading directly from webpage I get the raw html page. I am stuck as how to extract these values. Please find the code which I tried till now.

library(RCurl)
require(XML)
url = "http://www.imd.gov.in/section/hydro/dynamic/rfmaps/weekrain.htm"
download.file(url = url, destfile = "E:\indiaprecip")

Santyago · Accepted Answer

Just use function "htmlTreeParse" from XML

library(XML)
html <- htmlTreeParse("http://www.imd.gov.in/section/hydro/dynamic/rfmaps/weekrain.htm",
                     useInternalNodes = T)
xpathSApply(html, "//meta/@name")

But in your case you have another problem. The data which you want to access is located in html frame. Code below can help you to read these data:

library(XML)
library(RCulr)
url <- "http://www.imd.gov.in/section/hydro/dynamic/rfmaps/weekrain.htm"
html <- htmlTreeParse(url, useInternalNodes = T)
frameUrl <- paste("http://www.imd.gov.in/section/hydro/dynamic/rfmaps/",
                  xpathSApply(html, "//frame[1]/@src"),
                  sep = "")

htmlWithData = getURL(frameUrl,
                      httpheader = c("User-Agent" = "RCurl",
                                     "Referer" = url))

dataXml <- htmlTreeParse(htmlWithData, isURL = F, useInternalNodes = T)
xpathSApply(dataXml, "//body/table")

extract data from raw html in R

Tags:

r

xml

web-scraping

rcurl

Arun Raja

1 Answers

Santyago

Recent Activity

Donate For Us

extract data from raw html in R

Tags:

r

xml

web-scraping

rcurl

Arun Raja

1 Answers

Santyago

Related questions

Recent Activity

Donate For Us