Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split string at regular intervals

Tags:

string

r

I would like to split a string at regular intervals. My question is virtually identical to this one: How to split a string into substrings of a given length? except that I have a column of strings in a data set instead of just one string.

Here is an example data set:

df = read.table(text = "
my.id   X1    
010101   1
010102   1
010103   1
010104   1
020101   1
020112   1
021701   0
021802   0
133301   0
133302   0  
241114   0
241215   0
", header = TRUE, colClasses=c('character', 'numeric'), stringsAsFactors = FALSE)

Here is the desired result. I would prefer to remove the leading zeroes, as shown:

desired.result = read.table(text = "
A1 A2 A3   X1
 1  1  1   1
 1  1  2   1
 1  1  3   1
 1  1  4   1
 2  1  1   1
 2  1 12   1
 2 17  1   0
 2 18  2   0
13 33  1   0
13 33  2   0
24 11 14   0
24 12 15   0
", header = TRUE, colClasses=c('numeric', 'numeric', 'numeric', 'numeric'), stringsAsFactors = FALSE)

Here is a loop that seems to come close and maybe I can use it. However, I am thinking that there is likely a more efficient way.

for(i in 1:nrow(df)) {
     print(substring(df$my.id[i], seq(1, 5, 2), seq(2, 6, 2)))
}

This apply statement does not work:

apply(df$my.id, 1,  function(x) substring(df$my.id[x], seq(1, 5, 2), seq(2, 6, 2))   )

Thank you for any suggestions. I prefer a solution in base R.

like image 908
Mark Miller Avatar asked Feb 19 '13 00:02

Mark Miller


1 Answers

I find that read.fwf applied to a textConnection is the most efficient and easy-to-understand of the various ways one could approach this. It has the advantage of the automatic class detection that is built into the read.* functions.

cbind( read.fwf(file=textConnection(df$my.id), 
              widths=c(2,2,2), col.names=paste0("A", 1:3)), 
     X1=df$X1)
#-----------
   A1 A2 A3 X1
1   1  1  1  1
2   1  1  2  1
3   1  1  3  1
4   1  1  4  1
5   2  1  1  1
6   2  1 12  1
7   2 17  1  0
8   2 18  2  0
9  13 33  1  0
10 13 33  2  0
11 24 11 14  0
12 24 12 15  0

(I believe I learned this from Gabor Grothendieck on Rhelp about 6 years ago.)

If you prefer a regex strategy then look at this which inserts a tab every two positions and runs it through read.table. Very compact:

read.table(text=gsub('(.{2})','\\1\t',df$my.id) )
#---------
   V1 V2 V3
1   1  1  1
2   1  1  2
3   1  1  3
4   1  1  4
5   2  1  1
6   2  1 12
7   2 17  1
8   2 18  2
9  13 33  1
10 13 33  2
11 24 11 14
12 24 12 15
like image 179
IRTFM Avatar answered Nov 09 '22 18:11

IRTFM