Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split & extract part of string (between a "." and digit) in R

Tags:

regex

r

stringr

I have a character variable (companies) with observations that look like this:

  1. "612. Grt. Am. Mgt. & Inv. 7.33"
  2. "77. Wickes 4.61"
  3. "265. Wang Labs 8.75"
  4. "9. CrossLand Savings 6.32"
  5. "228. JPS Textile Group 2.00"

I'm trying to split these strings into 3 parts:

  1. all the digits before the first "." ,
  2. everything between the first "." and the next number (consistently formatted #.##), and
  3. that last number itself (format #.##).

Using the first obs as an example, I'd like: "612", "Grt. Am. Mgt & Inv", "5.01"

I've tried defining the pattern in rebus and using str_match, but the code below only works on cases like obs #2 and #3. It doesn't reflect all the variation in the middle part of the string to capture the other obs.

pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R% 
            capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC 
            %R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT 
            %R% one_or_more(DGT))

str_match(companies, pattern = pattern2)

Is there a better way to split the strings into these 3 parts?

I'm not familiar with regex, but I've seen that suggested here a lot (I'm brand new to R and Stack Overflow)

like image 652
Nina Avatar asked Feb 19 '19 04:02

Nina


1 Answers

You can delimit your string using regex and then split that strings for getting your results:

delimitedString = gsub( "^([0-9]+). (.*) ([0-9.]+)$", "\\1,\\2,\\3", companies  )

do.call( 'rbind', strsplit(split = ",", x = delimitedString) )
#      [,1]  [,2]                   [,3]  
#[1,] "612" "Grt. Am. Mgt. & Inv." "7.33"
#[2,] "77"  "Wickes"               "4.61"
#[3,] "265" "Wang Labs"            "8.75"
#[4,] "9"   "CrossLand Savings"    "6.32"
#[5,] "228" "JPS Textile Group"    "2.00" 

Regex explanation:

  • ^[0-9]+ : any pattern composed by numbers from 0 to 9 at the beginning (i.e. ^) of your string
  • .* : greedy match, basically anything surrounded by two spaces on the above case
  • [0-9.]+$: again numbers + a point and at the ending (i.e. $) of your string

Parenthesis are used to indicate that I want to catch those part of string which are fitted by regex. Upon caught them, those substring are collapsed and delimited by commas. Finally, we can split the whole string with strsplit function and bind rows with do.call function

like image 169
Ulises Rosas-Puchuri Avatar answered Sep 28 '22 07:09

Ulises Rosas-Puchuri