Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split strings at the first colon

Tags:

string

regex

r

gsub

I am reading data files in text format using readLines. The first 'column' is complicated text that I do not need. The next columns contain data that I do need. The first 'column' and the data are separated by a colon (:). I wish to split each row at the first colon and delete the resulting text string, keeping only the data.

Below is an example data file. One potential complication is that one line of data contains multiple colons. That line may at some point become my header. So, I probably should not split at every colon, just at the first colon.

my.data <- "first string of text..:  aa : bb : cc 
            next string ........  :   2    0    2
            third string......1990:   7    6    5
            last string           :   4    2    3"

my.data2 <- readLines(textConnection(my.data))
my.data2

I have tried code presented here:

Split on first comma in string

and here:

R: removing the last three dots from a string

Code at the first link above seems to split only at the first colon of the first row. Code at the second link will probably do what I want, but is too complex for me to modify it successfully so far.

Here are the data I hope to obtain, at which point I can simply replace the remaining colons in the first row with empty spaces using a very simple gsub statement:

   aa : bb : cc 
    2    0    2
    7    6    5
    4    2    3

Sorry if this is a duplicate of a post I have not located and thank you for any advice or assistance.

like image 788
Mark Miller Avatar asked Sep 02 '12 05:09

Mark Miller


1 Answers

The following will start at the beginning of the string and then grab everything up to and including the first colon and any additional spaces and replace that with nothing (essentially just removing it)

gsub("^[^:]+:\\s*", "", my.data2)

If you don't want to remove the spaces you could do

gsub("^[^:]+:", "", my.data2)

For some clarification on what the original regular expression is doing. Starting at the beginning:

^ this says to only find matches at the start of the string

[^:] this represents any character that is not a colon

+ this says to match the preceding character one or more times (so match as many non-colon characters as possible)

: this is what actually matches the colon

\\s this matches a space

* this says to match the preceding character zero or more times (so we remove any additional space after the colon)

So putting it all together we start at the beginning of the string then match as many non-colon characters as possible then grab the first colon character and any additional spaces and replace all of that with nothing (essentially removing all of the junk we don't want).

like image 187
Dason Avatar answered Oct 12 '22 23:10

Dason