strsplit by parentheses [duplicate]

Question

Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.

strsplit("A B C (123-456-789)", "\(")
[[1]]
[1] "A B C" "123-456-789)"

akrun · Accepted Answer

If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all

 library(stringr)
 str_extract(str1, '(?<=$)[0-9-]+(?=$)')
 #[1] "123-456-789"
 str_extract_all(str2, '(?<=$)[0-9-]+(?=$)')

In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=$)[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=$') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)

With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('$|$').

 strsplit(str1, '[()]')[[1]][2]
 #[1] "123-456-789"

If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep

 lapply(strsplit(str2, '[()]'), function(x) grep('\d', x, value=TRUE))

Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).

 library(stringi)
 stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
 #[1] "123-456-789"

 stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)

Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.

 library(qdapRegex)
 rm_round(str1, extract=TRUE)[[1]]
 #[1] "123-456-789"
 rm_round(str2, extract=TRUE)

data

 str1 <-  "A B C (123-456-789)"
 str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
 "(123-423-498) ABCDD", 
  "(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")

Cath · Answer

or with sub from base R:

sub("[^(]+$([^)]+)$.*", "\1", "A B C (123-456-789)")
#[1] "123-456-789"

Explanation:

[^(]+ : matches anything except an opening bracket
$ : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\1"), which is anything except a closing bracket
$.* matches a closing bracket followed by anything, 0 or more times

Another option with look-ahead and look-behind

sub(".*(?<=$)(.+)(?=$).*", "\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"

Pierre L · Answer

The capture groups in sub will target your desired output:

sub('.*$(.*)$.*', '\1', str1)
[1] "123-456-789"

Extra check to make sure I pass @akrun's extended example:

sub('.*$(.*)$.*', '\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"

Avinash Raj · Answer

You may try these gsub functions.

> gsub("[^\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*$|$", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"

Few more...

> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"

strsplit by parentheses [duplicate]

Tags:

regex

r

David Z

4 Answers

data

akrun

Cath

Pierre L

Avinash Raj

Recent Activity

Donate For Us

strsplit by parentheses [duplicate]

Tags:

regex

r

David Z

4 Answers

data

akrun

Cath

Pierre L

Avinash Raj

Related questions

Recent Activity

Donate For Us