Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse html string using R?

Tags:

regex

r

How to grep data item from this html string

a <- "<div class=\"tst-10\">100%</div>"

so that the result is 100%? The main idea is to get data between > <.

like image 737
jrara Avatar asked Apr 11 '26 08:04

jrara


1 Answers

I would use gsub() in this case:

gsub("(<.*>)(.*)(<.*>)", "\\2", a)
[1] "100%"

Basically, this breaks the string up into three parts, each separated by regular brackets ( and ). We can then use these as backreferences. The contents matched by the first set of backreferences can be referred to as \1 (use a double slash to escape the special character), those matched in the second, \2 and so on.

So, essentially, we're saying parse this string, figure out what matches my conditions, and return only the second backreference.

Piece by piece:

  • <.*> says to look for a "<" followed by any number of any characters ".*" up until you get to a ">"
  • .* means to match any number of characters (up until the next condition)

Keeping this in mind, you could actually probably use gsub("(.*>)(.*)(<.*)", "\\2", a) and get the same result.

like image 188
A5C1D2H2I1M1N2O1R2T1 Avatar answered Apr 12 '26 21:04

A5C1D2H2I1M1N2O1R2T1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!