Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chopping a string into a vector of fixed width character elements

Tags:

r

strsplit

I have an object containing a text string:

x <- "xxyyxyxy"

and I want to split that into a vector with each element containing two letters:

[1] "xx" "yy" "xy" "xy"

It seems like the strsplit should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?

like image 980
JD Long Avatar asked Feb 11 '10 19:02

JD Long


6 Answers

Using substring is the best approach:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

But here's a solution with plyr:

library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))
like image 82
Shane Avatar answered Nov 05 '22 00:11

Shane


Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.

x <- "xxyyxyxy"
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

Benchmark Setup:

library(microbenchmark)

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

Shane1 <- function(x) {
  substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2))
}

library("plyr")
Shane2 <- function(x) {
  laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1))
}

seth <- function(x) {
  strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
}

geoffjentry <- function(x) {
  idx <- 1:nchar(x)  
  odds <- idx[(idx %% 2) == 1]  
  evens <- idx[(idx %% 2) == 0]  
  substring(x, odds, evens)  
}

drewconway <- function(x) {
  c<-strsplit(x,"")[[1]]
  sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
}

KenWilliams <- function(x) {
  n <- 2
  sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1))
}

RichardScriven <- function(x) {
  regmatches(x, gregexpr("(.{2})", x))[[1]]
}

Benchmark 1:

x <- "xxyyxyxy"

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x)
)

# Unit: microseconds
#               expr      min        lq    median        uq      max neval
#            GSee(x)    8.032   12.7460   13.4800   14.1430   17.600   100
#          Shane1(x)   74.520   80.0025   84.8210   88.1385  102.246   100
#          Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300   100
#            seth(x)   36.318   43.3710   45.3270   47.5960   67.536   100
#     geoffjentry(x)    9.150   13.5500   15.3655   16.3080   41.066   100
#      drewconway(x)   92.329   98.1255  102.2115  105.6335  115.027   100
#     KenWilliams(x)   77.802   83.0395   87.4400   92.1540  163.705   100
#  RichardScriven(x)   55.034   63.1360   65.7545   68.4785  108.043   100

Benchmark 2:

Now, with bigger data.

x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="")

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x),
  times=3
)

# Unit: milliseconds
#               expr          min            lq       median            uq          max neval
#            GSee(x)    29.029226    31.3162690    33.603312    35.7046155    37.805919     3
#          Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361     3
#          Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694     3
#            seth(x)    86.668439    89.6322615    92.596084    92.8162885    93.036493     3
#     geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675     3
#      drewconway(x)   384.863713   438.7293075   492.594902   515.5538020   538.512702     3
#     KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468     3
#  RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651     3
like image 27
GSee Avatar answered Nov 05 '22 01:11

GSee


How about

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

Basically, add a separator (here " ") and then use strsplit

like image 21
seth Avatar answered Nov 05 '22 01:11

seth


strsplit is going to be problematic, look at a regexp like this

strsplit(z, '[[:alnum:]]{2}')  

it will split at the right points but nothing is left.

You could use substring & friends

z <- 'xxyyxyxy'  
idx <- 1:nchar(z)  
odds <- idx[(idx %% 2) == 1]  
evens <- idx[(idx %% 2) == 0]  
substring(z, odds, evens)  
like image 10
geoffjentry Avatar answered Nov 05 '22 01:11

geoffjentry


Here's one way, but not using regexen:

a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))
like image 8
Ken Williams Avatar answered Nov 05 '22 02:11

Ken Williams


ATTENTION with substring, if string length is not a multiple of your requested length, then you will need a +(n-1) in the second sequence:

substring(x,seq(1,nchar(x),n),seq(n,nchar(x)+n-1,n)) 
like image 7
Mario S Avatar answered Nov 05 '22 00:11

Mario S