Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text between certain symbols using Regular Expression in R

Tags:

regex

r

I have a series of expressions such as:

"<i>the text I need to extract</i></b></a></div>"

I need to extract the text between the <i> and </i> "symbols". This is, the result should be:

"the text I need to extract"

At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i> and </i>?

Thanks.

like image 901
Javier Avatar asked Nov 07 '14 20:11

Javier


People also ask

How do I find and extract matches of regular expressions in R?

Let’s do this in R! This example shows how to locate and extract matches of our regular expression in our character string using the functions of the basic installation of the R programming language. Let’s first apply the regexpr function to find the location of the regular expression match in our character string:

What is regular expression and how to use it?

One of the simplist and most robust way is to use the so-called Regular Expression. Don’t worry if you feel strange about the terminology. Its usage is very simple: Describe the pattern that matches the text and extract the desired part from that text.

How do I extract all dots from a string in R?

R has a function called ‘str_extract_all’ that will extract all the dots from these strings. This function takes two parameters. First the texts of interest and second, the element to be extracted.

Does regex use symbols to communicate patterns?

You can see that regex uses all kinds of symbols to communicate patterns. The Stringr Cheat Sheet is a helpful guide for when you want to develop your own patterns. This website provides an easy way of testing regex patterns.


1 Answers

If this is html (which it look like it is) you should probably use an html parser. Package XML can do this

library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"

On an entire html document, you can use

doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)
like image 157
Rich Scriven Avatar answered Oct 14 '22 14:10

Rich Scriven