I have a script that reads in data from a CSV file into a <code>data.table</code> and then splits the text in one column into several new columns. I am currently using the <code>lapply</code> and <code>strsplit</code> functions to do this. Here's an example: <pre class="prettyprint"><code>library("data.table") df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"), VALUE = 1:6) dt = as.data.table(df) # split PREFIX into new columns dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) dt # PREFIX VALUE PX PY # 1: A_B 1 A B # 2: A_C 2 A C # 3: A_D 3 A D # 4: B_A 4 B A # 5: B_C 5 B C # 6: B_D 6 B D </code></pre> In the example above the column <code>PREFIX</code> is split into two new columns <code>PX</code> and <code>PY</code> on the "_" character. Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using <code>data.table</code>. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important. <hr> <h3>UPDATE:</h3> Following @Frank's suggestion I created a larger test case and used the suggested commands, but the <code>stringr::str_split_fixed</code> takes a lot longer than the original method. <pre class="prettyprint"><code>library("data.table") library("stringr") system.time ({ df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000)) dt = data.table(df) }) # user system elapsed # 0.682 0.075 0.758 system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] }) # user system elapsed # 738.283 3.103 741.674 rm(dt) system.time ( { df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000) ) dt = as.data.table(df) }) # user system elapsed # 0.123 0.000 0.123 # split PREFIX into new columns system.time ({ dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) }) # user system elapsed # 33.185 0.000 33.191 </code></pre> So the <code>str_split_fixed</code> method takes about 20X times longer.

I add answer for someone who do not use <code>data.table</code> v1.9.5 and also want an one line solution. <pre class="prettyprint"><code>dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ] </code></pre>

Split text string in a data.table columns

Tags:

r

data.table

I have a script that reads in data from a CSV file into a data.table and then splits the text in one column into several new columns. I am currently using the lapply and strsplit functions to do this. Here's an example:

library("data.table") df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"),                 VALUE  = 1:6) dt = as.data.table(df)  # split PREFIX into new columns dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))  dt  #    PREFIX VALUE PX PY # 1:    A_B     1  A  B # 2:    A_C     2  A  C # 3:    A_D     3  A  D # 4:    B_A     4  B  A # 5:    B_C     5  B  C # 6:    B_D     6  B  D

In the example above the column PREFIX is split into two new columns PX and PY on the "_" character.

Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using data.table. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important.

UPDATE:

Following @Frank's suggestion I created a larger test case and used the suggested commands, but the stringr::str_split_fixed takes a lot longer than the original method.

library("data.table") library("stringr") system.time ({     df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                     VALUE  = rep(1:6, 1000000))     dt = data.table(df) }) #   user  system elapsed  #  0.682   0.075   0.758   system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] }) #    user  system elapsed  # 738.283   3.103 741.674   rm(dt) system.time ( {     df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                      VALUE = rep(1:6, 1000000) )     dt = as.data.table(df) }) #    user  system elapsed  #   0.123   0.000   0.123   # split PREFIX into new columns system.time ({     dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))     dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) }) #    user  system elapsed  #  33.185   0.000  33.191

So the str_split_fixed method takes about 20X times longer.

316

asked Aug 09 '13 19:08

Derric Lewis

2 Answers

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+ dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)] #    PREFIX VALUE PX PY # 1:    A_B     1  A  B # 2:    A_C     2  A  C # 3:    A_D     3  A  D # 4:    B_A     4  B  A # 5:    B_C     5  B  C # 6:    B_D     6  B  D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.

127

answered Sep 20 '22 11:09

Arun

I add answer for someone who do not use data.table v1.9.5 and also want an one line solution.

dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]

answered Sep 18 '22 11:09

Minh Ha Pham

Related questions
                            
                                How do I get the classes of all columns in a data frame? [duplicate]
                            
                                Combine legends for color and shape into a single legend
                            
                                R Error in x$ed : $ operator is invalid for atomic vectors
                            
                                Extracting time from POSIXct
                            
                                Fitting polynomial model to data in R
                            
                                Unique on a dataframe with only selected columns
                            
                                Get filename without extension in R
                            
                                How to delete columns that contain ONLY NAs?
                            
                                Using roxygen2 and doxygen on the same package? [closed]
                            
                                Save plots made in a shiny app
                            
                                Getting the last n elements of a vector. Is there a better way than using the length() function?
                            
                                Sum rows in data.frame or matrix
                            
                                ggplot plots in scripts do not display in Rstudio
                            
                                Extract row corresponding to minimum value of a variable by group
                            
                                R: removing NULL elements from a list
                            
                                Get column index from label in a data frame
                            
                                Change the Blank Cells to "NA"
                            
                                Remove multiple objects with rm()
                            
                                Generate a dummy-variable
                            
                                Compile R script into standalone .exe file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With