So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr
. How can I get multiple positions (more matches) back within the same string?
I am looking for a specific string in html source code. The titel of an auction (which is between html tags). It prooves kind of difficult to find:
So far I use this:
locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28
locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100))
substring(URL, locationstart[1], locationstart[1] + locationend - 2)
That is, I look for a part that comes before a title, then I capture that place, from there on look for a "<" indicating that the title ended. I'm open for more specific suggestions.
Example 2: Apply grep & grepl with Multiple Patterns We simply need to insert an |-operator between the patterns we want to search for. As you can see, both functions where searching for multiple pattern in the previous R code (i.e. “a” or “c”).
In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.
R Functions for Pattern Matchinggrep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE .
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
Using gregexpr
allows for multiple matches.
> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
and if you're looking to extract the match you can use regmatches
to do that for you.
> regmatches(x, m)
[[1]]
[1] "match"
[[2]]
[1] "match1" "match2"
[[3]]
character(0)
gregexpr
and regmatches
as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base}
package of R rather than requiring an additional package.
Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.
Specifically for the question of extracting multiple patterns in a long string, either str_extract_all
and str_match_all
can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist
or other approaches like lapply
, sapply
etc. The point is that the stringr functions return structures that can be used to access just what we want.
# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
"sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
"risus ipsum, aenean quis, sapien",
"in lorem, condimentum ornare viverra",
"suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
"dolor mauris tellus, dui leo purus varius")
# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"
str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With