Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.
strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
If we want to extract the digits with -
between the braces, one option is str_extract
. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -
. The positive lookbehind (?<=\\()[0-9-]+
matches numbers along with -
([0-9-]+
) in (123-456-789
and not in 123-456-789
. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with -
in 123-456-789)
and not in 123-456-798
. Taken together it matches all the cases that satisfy both the conditions (123-456-789)
and extract those in between the lookarounds and not with cases like (123-456-789
or 123-456-789)
With strsplit
you can specify the split
as [()]
. We keep the ()
inside the square brackets to []
to treat it as characters or else we have to escape the parentheses ('\\(|\\)'
).
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply
and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split
from stringi
which has the option to remove the empty strings as well (omit_empty=TRUE
).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round
from qdapRegex
if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
or with sub
from base R
:
sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"
Explanation:
[^(]+
: matches anything except an opening bracket\\(
: matches an opening bracket, which is just before what you want([^)]+)
: matches the pattern you want to capture (which is then retrieved in replacement="\\1"
), which is anything except a closing bracket\\).*
matches a closing bracket followed by anything, 0 or more times
Another option with look-ahead and look-behind
sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
The capture groups in sub
will target your desired output:
sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"
Extra check to make sure I pass @akrun's extended example:
sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
You may try these gsub functions.
> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"
Few more...
> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With