I have a large dataframe (20 columns, >100k rows) and need to split a column of character strings into multiple new columns.
The first 3 observations of the column in question are something like this:
scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))
which looks like this:
scan
1 CT Cervical Sp,CT Head Plain
2 II < 1 Hour
3 L-S Spine,L-S Spine
I need to split this into 5 columns (there are a maximum of 5 substrings in each observation), and for observations with fewer substrings I want the remaining columns filled with NAs. I am currently using this code:
scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))
for(i in 1:nrow(scans)){
scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]
scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]
scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]
scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]
scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5]
}
which works and outputs my desired solution:
scan scan1 scan2 scan3 scan4 scan5
1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain NA NA NA
2 II < 1 Hour II < 1 Hour NA NA NA NA
3 L-S Spine,L-S Spine L-S Spine L-S Spine NA NA NA
... but it is really slow. Looping over tens or hundreds of thousands of observations is time consuming.
Many thanks for any advice.
Another way is to use tstrsplit in the devel version of data.table
library(data.table) # v >= 1.9.5
setDT(scans)[, tstrsplit(scan, ",", fixed = TRUE)]
# V1 V2
# 1: CT Cervical Sp CT Head Plain
# 2: II < 1 Hour NA
# 3: L-S Spine L-S Spine
If you sure you will have 5 splits at least once, you could easily create these columns by reference
setDT(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]
Alternatively, the tidyr package offers a similar functuanality
library(tidyr)
separate(scans, scan, paste0("scan", 1:2), ",", extra = "merge", remove = FALSE)
# scan scan1 scan2
# 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
# 2 II < 1 Hour II < 1 Hour <NA>
# 3 L-S Spine,L-S Spine L-S Spine L-S Spine
Or another option using only base R
cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=TRUE, na.strings=''))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With