I have a string such as: <code>"aabbccccdd"</code> I want to break this string into a vector of substrings of length 2 : <code>"aa" "bb" "cc" "cc" "dd"</code>

Here is one way <pre class="prettyprint"><code>substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2)) #[1] "aa" "bb" "cc" "cc" "dd" </code></pre> or more generally <pre class="prettyprint"><code>text <- "aabbccccdd" substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2)) #[1] "aa" "bb" "cc" "cc" "dd" </code></pre> Edit: This is much, much faster <pre class="prettyprint"><code>sst <- strsplit(text, "")[[1]] out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)]) </code></pre> It first splits the string into characters. Then, it pastes together the even elements and the odd elements. Timings <pre class="prettyprint"><code>text <- paste(rep(paste0(letters, letters), 1000), collapse="") g1 <- function(text) { substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2)) } g2 <- function(text) { sst <- strsplit(text, "")[[1]] paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)]) } identical(g1(text), g2(text)) #[1] TRUE library(rbenchmark) benchmark(g1=g1(text), g2=g2(text)) # test replications elapsed relative user.self sys.self user.child sys.child #1 g1 100 95.451 79.87531 95.438 0 0 0 #2 g2 100 1.195 1.00000 1.196 0 0 0 </code></pre>

There are two easy possibilities: <pre class="prettyprint"><code>s <- "aabbccccdd" </code></pre> <ol> <li> <code>gregexpr</code> and <code>regmatches</code>: <pre class="prettyprint"><code>regmatches(s, gregexpr(".{2}", s))[[1]] # [1] "aa" "bb" "cc" "cc" "dd" </code></pre> </li> <li> <code>strsplit</code>: <pre class="prettyprint"><code>strsplit(s, "(?<=.{2})", perl = TRUE)[[1]] # [1] "aa" "bb" "cc" "cc" "dd" </code></pre> </li> </ol>

<pre class="prettyprint"><code>string <- "aabbccccdd" # total length of string num.chars <- nchar(string) # the indices where each substr will start starts <- seq(1,num.chars, by=2) # chop it up sapply(starts, function(ii) { substr(string, ii, ii+1) }) </code></pre> Which gives <pre class="prettyprint"><code>[1] "aa" "bb" "cc" "cc" "dd" </code></pre>

How to split a string into substrings of a given length? [duplicate]

5 Answers

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0

101

answered Oct 01 '22 17:10

GSee

There are two easy possibilities:

s <- "aabbccccdd"

gregexpr and regmatches:

regmatches(s, gregexpr(".{2}", s))[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

strsplit:

strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

answered Oct 01 '22 19:10

Sven Hohenstein

string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

[1] "aa" "bb" "cc" "cc" "dd"

answered Oct 01 '22 17:10

mindless.panda

One can use a matrix to group the characters:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1 and g2 from @GSee silently return incorrect results for an input of odd string length:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"

(It is indeed slower than g2, but faster than g1 by about a factor of 7)

answered Oct 01 '22 18:10

Matthew Lundberg

Ugly but works

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

answered Oct 01 '22 17:10

den2042

Related questions
                            
                                get string between two strings with javascript [duplicate]
                            
                                Contains is faster than StartsWith?
                            
                                Generate two different strings with the same hashcode
                            
                                Read into std::string using scanf
                            
                                Meaning of #{ } in Ruby?
                            
                                Removing non-breaking spaces from strings using Python
                            
                                Automatic conversion between String and Data.Text in haskell
                            
                                Split string with zsh as in Python
                            
                                Java: how to get Iterator<Character> from String [duplicate]
                            
                                How to read a file from classpath without external dependencies?
                            
                                How to merge two json string in Python?
                            
                                Raw text strings for file paths in R
                            
                                Return a variable in a Python list with double quotes instead of single
                            
                                How to deal with last comma, when making comma separated string? [duplicate]
                            
                                vb.net: can you split a string by a string
                            
                                Why is sizeof(string) == 32?
                            
                                Split string into array of characters?
                            
                                Convert String containing several numbers into integers
                            
                                Are strings mutable in Ruby?
                            
                                Unquote string in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split a string into substrings of a given length? [duplicate]

Tags:

string

split

r

MadSeb

People also ask

5 Answers

GSee

Sven Hohenstein

mindless.panda

Matthew Lundberg

den2042

Recent Activity

Donate For Us