Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strsplit by parentheses [duplicate]

Tags:

regex

r

Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.

strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
like image 923
David Z Avatar asked Jul 08 '15 12:07

David Z


4 Answers

If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all

 library(stringr)
 str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
 #[1] "123-456-789"
 str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')

In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)

With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').

 strsplit(str1, '[()]')[[1]][2]
 #[1] "123-456-789"

If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep

 lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))

Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).

 library(stringi)
 stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
 #[1] "123-456-789"

 stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)

Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.

 library(qdapRegex)
 rm_round(str1, extract=TRUE)[[1]]
 #[1] "123-456-789"
 rm_round(str2, extract=TRUE)

data

 str1 <-  "A B C (123-456-789)"
 str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
 "(123-423-498) ABCDD", 
  "(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
like image 59
akrun Avatar answered Nov 15 '22 15:11

akrun


or with sub from base R:

sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"

Explanation:

[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times

Another option with look-ahead and look-behind

sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
like image 22
Cath Avatar answered Nov 15 '22 14:11

Cath


The capture groups in sub will target your desired output:

sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"

Extra check to make sure I pass @akrun's extended example:

sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
like image 24
Pierre L Avatar answered Nov 15 '22 14:11

Pierre L


You may try these gsub functions.

> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"

Few more...

> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
like image 36
Avinash Raj Avatar answered Nov 15 '22 16:11

Avinash Raj