I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table
So that the output of my R code will be a table with the following columns:
(and with a row for each chemical element, obviously)
I am trying to get to the values inside the page using the XML package, but seems to be stuck in the beginning, so I'd appreciate an example on how to do it (and/or links to relevant examples)
library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
Try this:
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))
# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])
m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])
A bit of the output:
> dim(m3)
[1] 118 3
> head(m3)
URL Name Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen" "Hydrogen" "H"
[2,] "http://en.wikipedia.org/wiki/Helium" "Helium" "He"
[3,] "http://en.wikipedia.org/wiki/Lithium" "Lithium" "Li"
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"
[5,] "http://en.wikipedia.org/wiki/Boron" "Boron" "B"
[6,] "http://en.wikipedia.org/wiki/Carbon" "Carbon" "C"
We can make this more compact by enhancing the xpath expression further starting with Jeffrey's xpath expression (since it nearly gets the elements at top) and adding a qualification to it which exactly does. In that case xpathSApply
can be used to eliminate the need for do.call
or the plyr package. The last bit where we fix up odds and ends is the same as before. This produces a matrix rather than a data frame which seems preferable since the content is entirely character.
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]
# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])
View(M)
Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!
But alas, this is not what you want:
library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)
# ... look through the list to find the one you want...
table = tables[3]
table
$`NULL`
Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr
6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe
7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn
8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA>
11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>
The names are gone and the atomic number runs into the symbol.
So back to the drawing board...
My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td/a')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'title')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
# only keep the actual elements -- we're lucky they're first!
df = head(df, 118)
head(df)
symbol text link
1 Hydrogen H /wiki/Hydrogen
2 Helium He /wiki/Helium
3 Lithium Li /wiki/Lithium
4 Beryllium Be /wiki/Beryllium
5 Boron B /wiki/Boron
6 Carbon C /wiki/Carbon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With