I have a dataset that has widths at every pixel position along a central skeleton. The width is output as a single string that is comma delimited.
cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv")
str(cukeDatatest) # or dplyr::glimpse(cukeDatatest)
I need to keep the File and FruitNum identifiers with the widths.
The output I want has three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of ObjectWidth for that fruit. Position is important so sorting these vectors would be really bad. Also, every fruit is a different length (if that matters for your method).
I've used str_split() before to dissect a few elements from a string, but never something this large, nor so many of them (I have 8000 of them). Processing time is a concern, but would wait for correct result.
I'm more used to dplyr than data.table, but I see that there are some efforts from Arun in this: R split text string in a data.table columns
Strsplit(): An R Language function which is used to split the strings into substrings with split arguments. Where: X = input data file, vector or a stings. Split = Splits the strings into required formats.
Select the text you wish to split, and then click on the Data menu > Split text to columns. Select the Space. Your text will be split into columns.
Using splitstackshape
package
library(splitstackshape)
res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long")
# result
head(res)
# File FruitNum ObjectWidth
# 1: IMG_7888.JPGcolcorrected.jpg 1 4
# 2: IMG_7888.JPGcolcorrected.jpg 1 10
# 3: IMG_7888.JPGcolcorrected.jpg 1 14
# 4: IMG_7888.JPGcolcorrected.jpg 1 15
# 5: IMG_7888.JPGcolcorrected.jpg 1 22
# 6: IMG_7888.JPGcolcorrected.jpg 1 26
I'd normally start with simple strsplit
:
dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)]
If that's too slow, I'd run strsplit
on the entire column and then rearrange the data to my liking:
l = strsplit(dt$ObjectWidth, ",", fixed = T)
dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))),
.(File, FruitNum)][, col := unlist(l)][]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With