Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient replacement for for-loop when splitting strings in R

Tags:

performance

r

I have a large dataframe (20 columns, >100k rows) and need to split a column of character strings into multiple new columns.

The first 3 observations of the column in question are something like this:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour", 
                 "L-S Spine,L-S Spine"))

which looks like this:

                          scan
1 CT Cervical Sp,CT Head Plain
2                  II < 1 Hour
3          L-S Spine,L-S Spine

I need to split this into 5 columns (there are a maximum of 5 substrings in each observation), and for observations with fewer substrings I want the remaining columns filled with NAs. I am currently using this code:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))

for(i in 1:nrow(scans)){
  scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]
  scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]
  scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]
  scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]
  scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5]
}

which works and outputs my desired solution:

                          scan          scan1         scan2 scan3 scan4 scan5
1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain    NA    NA    NA
2                  II < 1 Hour    II < 1 Hour            NA    NA    NA    NA
3          L-S Spine,L-S Spine      L-S Spine     L-S Spine    NA    NA    NA

... but it is really slow. Looping over tens or hundreds of thousands of observations is time consuming.

Many thanks for any advice.

like image 805
Luke_radio Avatar asked Apr 01 '26 04:04

Luke_radio


1 Answers

Another way is to use tstrsplit in the devel version of data.table

library(data.table) # v >= 1.9.5
setDT(scans)[, tstrsplit(scan, ",", fixed = TRUE)]
#                V1            V2
# 1: CT Cervical Sp CT Head Plain
# 2:    II < 1 Hour            NA
# 3:      L-S Spine     L-S Spine 

If you sure you will have 5 splits at least once, you could easily create these columns by reference

setDT(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]

Alternatively, the tidyr package offers a similar functuanality

library(tidyr)
separate(scans, scan, paste0("scan", 1:2), ",", extra = "merge", remove = FALSE)
#                           scan          scan1         scan2
# 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
# 2                  II < 1 Hour    II < 1 Hour          <NA>
# 3          L-S Spine,L-S Spine      L-S Spine     L-S Spine

Or another option using only base R

 cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=TRUE, na.strings=''))
like image 88
David Arenburg Avatar answered Apr 02 '26 21:04

David Arenburg