Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string every n characters new column

Suppose I have a data frame like this with a string vector, var2

var1  var2
1     abcdefghi 
2     abcdefghijklmnop
3     abc 
4     abcdefghijklmnopqrst

What is the most efficient way to split var2 every n characters into new columns until the end of each string,

e.g if every 4 characters, the output would like look like this:

var1  var2                  new_var1  new_var2 new_var3  new_var4  new_var5
1     abcdefghi             abcd      efgh     i 
2     abcdefghijklmnop      abcd      efgh     ijkl      mnop 
3     abc                   abc
4     abcdefghijklmnopqrst  abcd      efgh     ijkl      mnop      qrst 

stringr package? Using "str_split_fixed"

Or Using regular expressions:

gsub("(.{4})", "\\1 ", "abcdefghi")

Capacity to create new columns that go to new_var_n depending on length of var2, which could be 10000 characters for example.

like image 744
Mikey Avatar asked Aug 05 '18 09:08

Mikey


4 Answers

Alternatively, you can try read.fwf in base R. No special package is needed:

tmp <- read.fwf(
    textConnection(dtf$var2),
    widths = rep(4, ceiling(max(nchar(dtf$var2) / 4))),
    stringsAsFactors = FALSE)

cbind(dtf, tmp)

#   var1                 var2   V1   V2   V3   V4   V5
# 1    1            abcdefghi abcd efgh    i <NA> <NA>
# 2    2     abcdefghijklmnop abcd efgh ijkl mnop <NA>
# 3    3                  abc  abc <NA> <NA> <NA> <NA>
# 4    4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
like image 162
mt1022 Avatar answered Oct 14 '22 07:10

mt1022


Here is one option with data.table and a helper function fixed_split that I took from this answer and slightly modified (it uses tstrsplit instead of strsplit).

library(data.table)
fixed_split <- function(text, n) {
  data.table::tstrsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}

Define n, the number of characters and new_vars, the number of columns to add first

n <- 4
new_vars <- ceiling(max(nchar(df$var2)) / n)

setDT(df)[, paste0("new_var", seq_len(new_vars)) := fixed_split(var2, n = n)][]
#   var1                 var2 new_var1 new_var2 new_var3 new_var4 new_var5
#1:    1            abcdefghi     abcd     efgh        i     <NA>     <NA>
#2:    2     abcdefghijklmnop     abcd     efgh     ijkl     mnop     <NA>
#3:    3                  abc      abc     <NA>     <NA>     <NA>     <NA>
#4:    4 abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst
like image 33
markus Avatar answered Oct 14 '22 08:10

markus


Here is an alternative using strsplit and matrix coercion

str_split_n <- function(x, n = 4) {
    sapply(x, function(ss) {
        nc <- nchar(as.character(ss))
        apply(matrix(replace(
            rep("", n * ceiling(nc / n)), 1:nc, unlist(strsplit(as.character(ss), ""))),
            nrow = n),
            2,
            paste0, collapse = "")
    })
}

library(dplyr)
library(tidyr)
df %>%
    mutate(tmp = str_split_n(var2)) %>%
    unnest() %>%
    group_by(var1) %>%
    mutate(n = paste0("new_var", 1:n())) %>%
    spread(n, tmp)
## A tibble: 4 x 7
## Groups:   var1 [4]
#   var1 var2                 new_var1 new_var2 new_var3 new_var4 new_var5
#  <int> <fct>                <chr>    <chr>    <chr>    <chr>    <chr>
#1     1 abcdefghi            abcd     efgh     i        NA       NA
#2     2 abcdefghijklmnop     abcd     efgh     ijkl     mnop     NA
#3     3 abc                  abc      NA       NA       NA       NA
#4     4 abcdefghijklmnopqrst abcd     efgh     ijkl     mnop     qrst
like image 2
Maurits Evers Avatar answered Oct 14 '22 08:10

Maurits Evers


You can use tidyr::separate :

library(tidyr)
n <- ((max(nchar(df$var2)) - 1) %/% 4) + 1
df %>% separate(var2, into=paste0("new_var", seq(n)), sep=seq(n-1)*4, remove = FALSE)
#   var1                 var2 new_var1 new_var2 new_var3 new_var4 new_var5
# 1    1            abcdefghi     abcd     efgh        i                  
# 2    2     abcdefghijklmnop     abcd     efgh     ijkl     mnop         
# 3    3                  abc      abc                                    
# 4    4 abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst

We first count how many groups we'll have using integer division, then we define new names on the fly and split at relevant positions using numeric values in the sep argument.

data

df <- read.table(text="var1  var2
1     abcdefghi 
2     abcdefghijklmnop
3     abc 
4     abcdefghijklmnopqrst",strin=F,h=T)
like image 1
Moody_Mudskipper Avatar answered Oct 14 '22 06:10

Moody_Mudskipper