I have a character variable (companies
) with observations that look like this:
I'm trying to split these strings into 3 parts:
"."
, "."
and the next number
(consistently formatted #.##
), and #.##
). Using the first obs as an example, I'd like: "612", "Grt. Am. Mgt & Inv", "5.01"
I've tried defining the pattern in rebus
and using str_match
, but the code below only works on cases like obs #2 and #3. It doesn't reflect all the variation in the middle part of the string to capture the other obs.
pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R%
capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC
%R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT
%R% one_or_more(DGT))
str_match(companies, pattern = pattern2)
Is there a better way to split the strings into these 3 parts?
I'm not familiar with regex
, but I've seen that suggested here a lot (I'm brand new to R and Stack Overflow)
You can delimit your string using regex and then split that strings for getting your results:
delimitedString = gsub( "^([0-9]+). (.*) ([0-9.]+)$", "\\1,\\2,\\3", companies )
do.call( 'rbind', strsplit(split = ",", x = delimitedString) )
# [,1] [,2] [,3]
#[1,] "612" "Grt. Am. Mgt. & Inv." "7.33"
#[2,] "77" "Wickes" "4.61"
#[3,] "265" "Wang Labs" "8.75"
#[4,] "9" "CrossLand Savings" "6.32"
#[5,] "228" "JPS Textile Group" "2.00"
Regex explanation:
^[0-9]+
: any pattern composed by numbers from 0 to 9 at the beginning (i.e. ^
) of your string .*
: greedy match, basically anything surrounded by two spaces on the above case[0-9.]+$
: again numbers + a point and at the ending (i.e. $
) of your stringParenthesis are used to indicate that I want to catch those part of string which are fitted by regex. Upon caught them, those substring are collapsed and delimited by commas. Finally, we can split the whole string with strsplit
function and bind rows with do.call
function
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With