Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split text string in a data.table columns

Tags:

r

data.table

I have a script that reads in data from a CSV file into a data.table and then splits the text in one column into several new columns. I am currently using the lapply and strsplit functions to do this. Here's an example:

library("data.table") df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"),                 VALUE  = 1:6) dt = as.data.table(df)  # split PREFIX into new columns dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))  dt  #    PREFIX VALUE PX PY # 1:    A_B     1  A  B # 2:    A_C     2  A  C # 3:    A_D     3  A  D # 4:    B_A     4  B  A # 5:    B_C     5  B  C # 6:    B_D     6  B  D  

In the example above the column PREFIX is split into two new columns PX and PY on the "_" character.

Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using data.table. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important.


UPDATE:

Following @Frank's suggestion I created a larger test case and used the suggested commands, but the stringr::str_split_fixed takes a lot longer than the original method.

library("data.table") library("stringr") system.time ({     df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                     VALUE  = rep(1:6, 1000000))     dt = data.table(df) }) #   user  system elapsed  #  0.682   0.075   0.758   system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] }) #    user  system elapsed  # 738.283   3.103 741.674   rm(dt) system.time ( {     df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                      VALUE = rep(1:6, 1000000) )     dt = as.data.table(df) }) #    user  system elapsed  #   0.123   0.000   0.123   # split PREFIX into new columns system.time ({     dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))     dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) }) #    user  system elapsed  #  33.185   0.000  33.191  

So the str_split_fixed method takes about 20X times longer.

like image 316
Derric Lewis Avatar asked Aug 09 '13 19:08

Derric Lewis


People also ask

How do you split a column by delimiter in Excel?

Select Home > Split Column > By Delimiter. The Split a column by delimiter dialog box appears. In the Select or enter a delimiter drop-down, select Colon, Comma, Equals Sign, Semicolon, Space, Tab, or Custom. You can also select Custom to specify any character delimiter.

How do you split Data in a cell in Excel?

On the Data tab, in the Data Tools group, click Text to Columns. The Convert Text to Columns Wizard opens. Choose Delimited if it is not already selected, and then click Next. Select the delimiter or delimiters to define the places where you want to split the cell content.

How do I split a column into multiple columns in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.


2 Answers

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+ dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)] #    PREFIX VALUE PX PY # 1:    A_B     1  A  B # 2:    A_C     2  A  C # 3:    A_D     3  A  D # 4:    B_A     4  B  A # 5:    B_C     5  B  C # 6:    B_D     6  B  D 

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.

like image 127
Arun Avatar answered Sep 20 '22 11:09

Arun


I add answer for someone who do not use data.table v1.9.5 and also want an one line solution.

dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ] 
like image 22
Minh Ha Pham Avatar answered Sep 18 '22 11:09

Minh Ha Pham