Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truncating the end of a string in R after a character that can be present zero or more times

Tags:

string

r

truncate

I have the following data:

temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "PARKING BRAKE:CONVENTIONAL",
    "SEATS:FRONT ASSEMBLY:POWER ADJUST",
    "POWER TRAIN:AUTOMATIC TRANSMISSION",
    "SUSPENSION",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "SUSPENSION:FRONT",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")

I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.

I have tried to use:

temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)), 
+                        ncol=2, byrow=TRUE))

but it does not work in the cases where there is no ":"

I know this question is very similar to: truncate string from a certain character in R, which used:

sub("^[^.]*", "", x)

But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.

like image 282
Tony M. Avatar asked Jun 04 '12 15:06

Tony M.


1 Answers

You can solve this with a simple regex:

sub("(.*?):.*", "\\1", x)
 [1] "AIR BAGS"                  "SERVICE BRAKES HYDRAULIC"  "PARKING BRAKE"             "SEATS"                    
 [5] "POWER TRAIN"               "SUSPENSION"                "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC" 
 [9] "SUSPENSION"                "ENGINE AND ENGINE COOLING" "VISIBILITY"     

How the regex works:

  • "(.*?):.*" Look for a repeated set of any characters .* but modify it with ? to not be greedy. This should be followed by a colon and then any character (repeated)
  • Substitute the entire string with the bit found inside the parentheses - "\\1"

The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.

like image 111
Andrie Avatar answered Oct 23 '22 02:10

Andrie