Suppose I have a data frame like this with a string vector, var2
var1 var2
1 abcdefghi
2 abcdefghijklmnop
3 abc
4 abcdefghijklmnopqrst
What is the most efficient way to split var2 every n characters into new columns until the end of each string,
e.g if every 4 characters, the output would like look like this:
var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
1 abcdefghi abcd efgh i
2 abcdefghijklmnop abcd efgh ijkl mnop
3 abc abc
4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
stringr package? Using "str_split_fixed"
Or Using regular expressions:
gsub("(.{4})", "\\1 ", "abcdefghi")
Capacity to create new columns that go to new_var_n depending on length of var2, which could be 10000 characters for example.
Alternatively, you can try read.fwf
in base R. No special package is needed:
tmp <- read.fwf(
textConnection(dtf$var2),
widths = rep(4, ceiling(max(nchar(dtf$var2) / 4))),
stringsAsFactors = FALSE)
cbind(dtf, tmp)
# var1 var2 V1 V2 V3 V4 V5
# 1 1 abcdefghi abcd efgh i <NA> <NA>
# 2 2 abcdefghijklmnop abcd efgh ijkl mnop <NA>
# 3 3 abc abc <NA> <NA> <NA> <NA>
# 4 4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
Here is one option with data.table
and a helper function fixed_split
that I took from this answer and slightly modified (it uses tstrsplit
instead of strsplit
).
library(data.table)
fixed_split <- function(text, n) {
data.table::tstrsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}
Define n
, the number of characters and new_vars
, the number of columns to add first
n <- 4
new_vars <- ceiling(max(nchar(df$var2)) / n)
setDT(df)[, paste0("new_var", seq_len(new_vars)) := fixed_split(var2, n = n)][]
# var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
#1: 1 abcdefghi abcd efgh i <NA> <NA>
#2: 2 abcdefghijklmnop abcd efgh ijkl mnop <NA>
#3: 3 abc abc <NA> <NA> <NA> <NA>
#4: 4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
Here is an alternative using strsplit
and matrix
coercion
str_split_n <- function(x, n = 4) {
sapply(x, function(ss) {
nc <- nchar(as.character(ss))
apply(matrix(replace(
rep("", n * ceiling(nc / n)), 1:nc, unlist(strsplit(as.character(ss), ""))),
nrow = n),
2,
paste0, collapse = "")
})
}
library(dplyr)
library(tidyr)
df %>%
mutate(tmp = str_split_n(var2)) %>%
unnest() %>%
group_by(var1) %>%
mutate(n = paste0("new_var", 1:n())) %>%
spread(n, tmp)
## A tibble: 4 x 7
## Groups: var1 [4]
# var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
# <int> <fct> <chr> <chr> <chr> <chr> <chr>
#1 1 abcdefghi abcd efgh i NA NA
#2 2 abcdefghijklmnop abcd efgh ijkl mnop NA
#3 3 abc abc NA NA NA NA
#4 4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
You can use tidyr::separate
:
library(tidyr)
n <- ((max(nchar(df$var2)) - 1) %/% 4) + 1
df %>% separate(var2, into=paste0("new_var", seq(n)), sep=seq(n-1)*4, remove = FALSE)
# var1 var2 new_var1 new_var2 new_var3 new_var4 new_var5
# 1 1 abcdefghi abcd efgh i
# 2 2 abcdefghijklmnop abcd efgh ijkl mnop
# 3 3 abc abc
# 4 4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst
We first count how many groups we'll have using integer division, then we define new names on the fly and split at relevant positions using numeric values in the sep
argument.
data
df <- read.table(text="var1 var2
1 abcdefghi
2 abcdefghijklmnop
3 abc
4 abcdefghijklmnopqrst",strin=F,h=T)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With