Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple regexpr in one string in R

Tags:

regex

r

So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr. How can I get multiple positions (more matches) back within the same string?

I am looking for a specific string in html source code. The titel of an auction (which is between html tags). It prooves kind of difficult to find:

So far I use this:

locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28
locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100))
substring(URL, locationstart[1], locationstart[1] + locationend - 2)

That is, I look for a part that comes before a title, then I capture that place, from there on look for a "<" indicating that the title ended. I'm open for more specific suggestions.

like image 468
PascalVKooten Avatar asked May 06 '13 15:05

PascalVKooten


People also ask

How do I grep multiple patterns in r?

Example 2: Apply grep & grepl with Multiple Patterns We simply need to insert an |-operator between the patterns we want to search for. As you can see, both functions where searching for multiple pattern in the previous R code (i.e. “a” or “c”).

What does \d mean in RegEx?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.

How do I match a pattern in r?

R Functions for Pattern Matchinggrep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE .

What will the '$' regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.


2 Answers

Using gregexpr allows for multiple matches.

> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

[[2]]
[1]  1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE

[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

and if you're looking to extract the match you can use regmatches to do that for you.

> regmatches(x, m)
[[1]]
[1] "match"

[[2]]
[1] "match1" "match2"

[[3]]
character(0)
like image 105
Dason Avatar answered Oct 23 '22 19:10

Dason


gregexpr and regmatches as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base} package of R rather than requiring an additional package.

Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.

Specifically for the question of extracting multiple patterns in a long string, either str_extract_all and str_match_all can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist or other approaches like lapply, sapply etc. The point is that the stringr functions return structures that can be used to access just what we want.

# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
                 "sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
                 "risus ipsum, aenean quis, sapien",
                 "in lorem, condimentum ornare viverra",
                 "suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
                 "dolor mauris tellus, dui leo purus varius")

# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<"  "<blah>MATCH2<"     "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"

str_match_all(htmlInput,  "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE"  "MATCH2"     "MATCH Nr 3" "LAST MATCH"
like image 20
mjv Avatar answered Oct 23 '22 18:10

mjv