Multiple regexpr in one string in R

Q: How do I grep multiple patterns in r?

Example 2: Apply grep & grepl with Multiple Patterns We simply need to insert an |-operator between the patterns we want to search for. As you can see, both functions where searching for multiple pattern in the previous R code (i.e. “a” or “c”).

Q: What does \d mean in RegEx?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.

Q: How do I match a pattern in r?

R Functions for Pattern Matchinggrep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE .

Q: What will the '$' regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

Tags:

regex

r

So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr. How can I get multiple positions (more matches) back within the same string?

I am looking for a specific string in html source code. The titel of an auction (which is between html tags). It prooves kind of difficult to find:

So far I use this:

locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28
locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100))
substring(URL, locationstart[1], locationstart[1] + locationend - 2)

That is, I look for a part that comes before a title, then I capture that place, from there on look for a "<" indicating that the title ended. I'm open for more specific suggestions.

468

asked May 06 '13 15:05

PascalVKooten

2 Answers

Using gregexpr allows for multiple matches.

> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

[[2]]
[1]  1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE

[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

and if you're looking to extract the match you can use regmatches to do that for you.

> regmatches(x, m)
[[1]]
[1] "match"

[[2]]
[1] "match1" "match2"

[[3]]
character(0)

105

answered Oct 23 '22 19:10

Dason

gregexpr and regmatches as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base} package of R rather than requiring an additional package.

Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.

Specifically for the question of extracting multiple patterns in a long string, either str_extract_all and str_match_all can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist or other approaches like lapply, sapply etc. The point is that the stringr functions return structures that can be used to access just what we want.

# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
                 "sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
                 "risus ipsum, aenean quis, sapien",
                 "in lorem, condimentum ornare viverra",
                 "suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
                 "dolor mauris tellus, dui leo purus varius")

# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<"  "<blah>MATCH2<"     "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"

str_match_all(htmlInput,  "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE"  "MATCH2"     "MATCH Nr 3" "LAST MATCH"

answered Oct 23 '22 18:10

mjv

Related questions
                            
                                Python: How to prepend the string 'ub' to every pronounced vowel in a string?
                            
                                Python 3: Searching A Large Text File With REGEX
                            
                                get inner patterns recursively using regex c#
                            
                                Why order matters in this RegEx with alternation?
                            
                                Unicode, regular expressions and PyPy
                            
                                Calculate Number of Consecutive Characters in a String using Perl
                            
                                Find/Replace using grep and Textwrangler
                            
                                BASH - find specific folder with find and filter with regex
                            
                                replace() and replaceAll() in Java
                            
                                awk syntax for getting part of a matched regex
                            
                                How to get float value from string
                            
                                Python re: Storing multiple matches in variables
                            
                                How do I use javascript to replace hash tags with links from a jquery data-attribute
                            
                                How to highlight words beginning with ‘@’ in Vim syntax?
                            
                                Using regular expression to comma separate a large number in south asian numbering system
                            
                                Matching two overlapping patterns with Perl
                            
                                c# regex.ismatch using a variable
                            
                                Use powershell ForEach-Object to match and replace string with regex
                            
                                Extracting single values from a parsed NSString in objective-c
                            
                                Extract all words between two specific words in a character vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With