Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse XML to R data frame

Tags:

r

xml

I tried to parse an XML file to an R data frame. This link helped me a lot:

How to create an R data frame from an xml file?

But still I was not able to figure out my problem. Here is my code:

data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
xmlToDataFrame(nodes=getNodeSet(data1,"//data"))[c("location","time-layout")]
step1 <- xmlToDataFrame(nodes=getNodeSet(data1,"//location/point"))[c("latitude","longitude")]
step2 <- xmlToDataFrame(nodes=getNodeSet(data1,"//time-layout/start-valid-time"))
step3 <- xmlToDataFrame(nodes=getNodeSet(data1,"//parameters/temperature"))[c("type="hourly"")]

The data frame I want to have is like this:

latitude  longitude   start-valid-time   hourly_temperature
29.803     -82.411  2013-06-19T15:00:00-04:00    91
29.803     -82.411  2013-06-19T16:00:00-04:00    90

I'm stuck at the xmlToDataFrame(), any help would be very much appreciated.

like image 529
Rosa Avatar asked Jun 19 '13 18:06

Rosa


People also ask

How do I get data from XML file in R?

An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.

What is R XML?

It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file. You can read a xml file in R using the "XML" package.


4 Answers

Data in XML format are rarely organized in a way that would allow the xmlToDataFrame function to work. You're better off extracting everything in lists and then binding the lists together in a data frame:

require(XML)
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")

xml_data <- xmlToList(data)

In the case of your example data, getting location and start time is fairly straightforward:

location <- as.list(xml_data[["data"]][["location"]][["point"]])

start_time <- unlist(xml_data[["data"]][["time-layout"]][
    names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])

Temperature data is a bit more complicated. First you need to get to the node that contains the temperature lists. Then you need extract both the lists, look within each one, and pick the one that has "hourly" as one of its values. Then you need to select only that list but only keep the values that have the "value" label:

temps <- xml_data[["data"]][["parameters"]]
temps <- temps[names(temps) == "temperature"]
temps <- temps[sapply(temps, function(x) any(unlist(x) == "hourly"))]
temps <- unlist(temps[[1]][sapply(temps, names) == "value"])

out <- data.frame(
  as.list(location),
  "start_valid_time" = start_time,
  "hourly_temperature" = temps)

head(out)
  latitude longitude          start_valid_time hourly_temperature
1    29.81    -82.42 2013-06-19T16:00:00-04:00                 91
2    29.81    -82.42 2013-06-19T17:00:00-04:00                 90
3    29.81    -82.42 2013-06-19T18:00:00-04:00                 89
4    29.81    -82.42 2013-06-19T19:00:00-04:00                 85
5    29.81    -82.42 2013-06-19T20:00:00-04:00                 83
6    29.81    -82.42 2013-06-19T21:00:00-04:00                 80
like image 74
SchaunW Avatar answered Oct 06 '22 04:10

SchaunW


Use xpath more directly for both performance and clarity.

time_path <- "//start-valid-time"
temp_path <- "//temperature[@type='hourly']/value"

df <- data.frame(
    latitude=data[["number(//point/@latitude)"]],
    longitude=data[["number(//point/@longitude)"]],
    start_valid_time=sapply(data[time_path], xmlValue),
    hourly_temperature=as.integer(sapply(data[temp_path], as, "integer"))

leading to

> head(df, 2)
  latitude longitude          start_valid_time hourly_temperature
1    29.81    -82.42 2014-02-14T18:00:00-05:00                 60
2    29.81    -82.42 2014-02-14T19:00:00-05:00                 55
like image 23
Martin Morgan Avatar answered Oct 06 '22 04:10

Martin Morgan


Here's a partial solution using xml2. Breaking the solution up into smaller pieces generally makes it easier to ensure everything is lined up:

library(xml2)
data <- read_xml("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")

# Point locations
point <- data %>% xml_find_all("//point")
point %>% xml_attr("latitude") %>% as.numeric()
point %>% xml_attr("longitude") %>% as.numeric()

# Start time
data %>% 
  xml_find_all("//start-valid-time") %>% 
  xml_text()

# Temperature
data %>% 
  xml_find_all("//temperature[@type='hourly']/value") %>% 
  xml_text() %>% 
  as.integer()
like image 25
hadley Avatar answered Oct 06 '22 06:10

hadley


You can try the code below:

# Load the packages required to read XML files.
library("XML")
library("methods")

# Convert the input xml file to a data frame.
xmldataframe <- xmlToDataFrame("input.xml")
print(xmldataframe)
like image 44
abhishek dandona Avatar answered Oct 06 '22 06:10

abhishek dandona