I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:
Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]
(Data
is my data table and complaint
is the name of the column)
Obviously, this is not efficient because each cell in each row has different number of words.
Could you please tell me about a more efficient way to do this?
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
The split() function in R can be used to split data into groups based on factor levels. This function uses the following basic syntax: split(x, f, …)
Two functions, transpose()
and tstrsplit()
, are available since version 1.9.6 on CRAN.
With this we can do:
require(data.table)
setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
# V1 V2 V3 V4
# 1: This is interesting NA
# 2: This actually is not
tstrsplit
is a wrapper for transpose(strsplit(...))
.
Check out cSplit
from my "splitstackshape" package. It works on either data.frame
s or data.table
s (but always returns a data.table
).
Assuming KFB's sample data is at least slightly representative of your actual data, you can try:
library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not
Another (blazing) option is to use stri_split_fixed
with simplify = TRUE
(from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):
library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"
Here is a solution based on rbind.fill.matrix(...)
in the plyr
package. On a dataset with 20,000 rows it runs in about 3.6 sec.
# create an sample dataset - you have this already
library(data.table)
words <- LETTERS[1:10] # "words" are just letters in this example
set.seed(1) # for reproducible example
w <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
dt <- data.table(words=w)
head(dt)
# complaint
# 1: D F H
# 2: I J F
# 3: A B I E C D H
# 4: J D G H B I A E
# 5: A D G C
# 6: F E B J I
# you start here...
library(plyr)
result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
result <- as.data.table(result)
head(result)
# 1 2 3 4 5 6 7 8 9 10
# 1: D F H NA NA NA NA NA NA NA
# 2: I J F NA NA NA NA NA NA NA
# 3: A B I E C D H NA NA NA
# 4: J D G H B I A E NA NA
# 5: A D G C NA NA NA NA NA NA
# 6: F E B J I NA NA NA NA NA
EDIT: Added some benchmarking based on @Ananda's comment below.
f.rfm <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
library(splitstackshape)
f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
library(stringi)
f.sl2m <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
f.ssf <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))
all.equal(f.rfm(),f.csplit(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.ssf(),check.names=FALSE)
# [1] TRUE
library(microbenchmark)
microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299 10
# f.csplit() 98.05709 102.46456 104.51046 107.9588 117.26945 10
# f.sl2m() 55.45527 55.58852 56.75406 58.9347 67.44523 10
# f.ssf() 17.77499 17.98879 18.30831 18.4537 21.62161 10
So it looks like stri_split_fixed(...)
is the winner.
OK for both data.table and data.frame
# toy data
df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not",
"This is interesting"), class = "factor")), .Names = "x", row.names = c(NA,
-2L), class = "data.frame")
# x
# 1 This is interesting
# 2 This actually is not
# the code
split_result <- strsplit(as.character(df$x), " ")
length_n <- sapply(split_result, length)
length_max <- seq_len(max(length_n))
as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)
# V1 V2 V3 V4
# 1 This is interesting <NA>
# 2 This actually is not
An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.
If you run strsplit(as.character(Data[,1]), " ")
you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objective
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With