Split & extract part of string (between a "." and digit) in R

Question

I have a character variable (companies) with observations that look like this:

"612. Grt. Am. Mgt. & Inv. 7.33"
"77. Wickes 4.61"
"265. Wang Labs 8.75"
"9. CrossLand Savings 6.32"
"228. JPS Textile Group 2.00"

I'm trying to split these strings into 3 parts:

all the digits before the first "." ,
everything between the first "." and the next number (consistently formatted #.##), and
that last number itself (format #.##).

Using the first obs as an example, I'd like: "612", "Grt. Am. Mgt & Inv", "5.01"

I've tried defining the pattern in rebus and using str_match, but the code below only works on cases like obs #2 and #3. It doesn't reflect all the variation in the middle part of the string to capture the other obs.

pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R% 
            capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC 
            %R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT 
            %R% one_or_more(DGT))

str_match(companies, pattern = pattern2)

Is there a better way to split the strings into these 3 parts?

I'm not familiar with regex, but I've seen that suggested here a lot (I'm brand new to R and Stack Overflow)

Ulises Rosas-Puchuri · Accepted Answer

You can delimit your string using regex and then split that strings for getting your results:

delimitedString = gsub( "^([0-9]+). (.*) ([0-9.]+)$", "\1,\2,\3", companies  )

do.call( 'rbind', strsplit(split = ",", x = delimitedString) )
#      [,1]  [,2]                   [,3]  
#[1,] "612" "Grt. Am. Mgt. & Inv." "7.33"
#[2,] "77"  "Wickes"               "4.61"
#[3,] "265" "Wang Labs"            "8.75"
#[4,] "9"   "CrossLand Savings"    "6.32"
#[5,] "228" "JPS Textile Group"    "2.00"

Regex explanation:

^[0-9]+ : any pattern composed by numbers from 0 to 9 at the beginning (i.e. ^) of your string
.* : greedy match, basically anything surrounded by two spaces on the above case
[0-9.]+$: again numbers + a point and at the ending (i.e. $) of your string

Parenthesis are used to indicate that I want to catch those part of string which are fitted by regex. Upon caught them, those substring are collapsed and delimited by commas. Finally, we can split the whole string with strsplit function and bind rows with do.call function

Split & extract part of string (between a "." and digit) in R

Tags:

regex

r

stringr

Nina

1 Answers

Ulises Rosas-Puchuri

Recent Activity

Donate For Us

Split & extract part of string (between a "." and digit) in R

Tags:

regex

r

stringr

Nina

1 Answers

Ulises Rosas-Puchuri

Related questions

Recent Activity

Donate For Us