Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R parse HTML document and use xpath to get all matches of two patterns

Tags:

html

r

xpath

So, I parsed HTML code from FIFA worldcup website, and want to get all the matches:

 wcup <- htmlTreeParse("http://www.fifa.com/worldcup/matches/", useInternalNodes=T)

However, the field for one country is 't-nText kern' and for the rest of countries is 't-nText '.

 <span class="t-nText kern">Bosnia and Herzegovina</span>

Therefore, if I use this command, I will miss 'Bosnia and Herzegovina', like this command:

xpathSApply(wcup, "//span[@class='t-nText ']", xmlValue)

So, is there any way that I can search for both attributes 't-nText ' and 't-nText kern' at the same time? Or do you have any other solution? I want to keep the order of the matches as is.

xpath doesn't support logical OR:

xpathSApply(wcup, "//span[@class='t-nText ' || 't-nText kern']", xmlValue)
XPath error : Invalid expression
//span[@class='t-nText ' || 't-nText kern']
                          ^
XPath error : Invalid expression
//span[@class='t-nText ' || 't-nText kern']
                                          ^
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //span[@class='t-nText ' || 't-nText kern']
like image 415
Vahid Mirjalili Avatar asked Jun 10 '14 23:06

Vahid Mirjalili


1 Answers

Use 'or' or perhaps 'starts-with()',

wcup["//span[@class='t-nText kern' or @class='t-nText ']"]
wcup["//span[starts-with(@class, 't-nText ')]"]
like image 182
Martin Morgan Avatar answered Sep 22 '22 08:09

Martin Morgan