I have a series of expressions such as:
"<i>the text I need to extract</i></b></a></div>"
I need to extract the text between the <i>
and </i>
"symbols". This is, the result should be:
"the text I need to extract"
At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i>
and </i>
?
Thanks.
Let’s do this in R! This example shows how to locate and extract matches of our regular expression in our character string using the functions of the basic installation of the R programming language. Let’s first apply the regexpr function to find the location of the regular expression match in our character string:
One of the simplist and most robust way is to use the so-called Regular Expression. Don’t worry if you feel strange about the terminology. Its usage is very simple: Describe the pattern that matches the text and extract the desired part from that text.
R has a function called ‘str_extract_all’ that will extract all the dots from these strings. This function takes two parameters. First the texts of interest and second, the element to be extracted.
You can see that regex uses all kinds of symbols to communicate patterns. The Stringr Cheat Sheet is a helpful guide for when you want to develop your own patterns. This website provides an easy way of testing regex patterns.
If this is html (which it look like it is) you should probably use an html parser. Package XML
can do this
library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
On an entire html document, you can use
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With