Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string and transpose result

I have a dataset that has widths at every pixel position along a central skeleton. The width is output as a single string that is comma delimited.

cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv")
str(cukeDatatest) # or dplyr::glimpse(cukeDatatest)

I need to keep the File and FruitNum identifiers with the widths.

The output I want has three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of ObjectWidth for that fruit. Position is important so sorting these vectors would be really bad. Also, every fruit is a different length (if that matters for your method).

I've used str_split() before to dissect a few elements from a string, but never something this large, nor so many of them (I have 8000 of them). Processing time is a concern, but would wait for correct result.

I'm more used to dplyr than data.table, but I see that there are some efforts from Arun in this: R split text string in a data.table columns

like image 354
bhive01 Avatar asked May 16 '16 19:05

bhive01


People also ask

What does Strsplit do in R?

Strsplit(): An R Language function which is used to split the strings into substrings with split arguments. Where: X = input data file, vector or a stings. Split = Splits the strings into required formats.

How do I separate text and space in Excel?

Select the text you wish to split, and then click on the Data menu > Split text to columns. Select the Space. Your text will be split into columns.


2 Answers

Using splitstackshape package

library(splitstackshape)
res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long")

# result
head(res)
#                            File FruitNum ObjectWidth
# 1: IMG_7888.JPGcolcorrected.jpg        1           4
# 2: IMG_7888.JPGcolcorrected.jpg        1          10
# 3: IMG_7888.JPGcolcorrected.jpg        1          14
# 4: IMG_7888.JPGcolcorrected.jpg        1          15
# 5: IMG_7888.JPGcolcorrected.jpg        1          22
# 6: IMG_7888.JPGcolcorrected.jpg        1          26
like image 64
zx8754 Avatar answered Oct 04 '22 17:10

zx8754


I'd normally start with simple strsplit:

dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)]

If that's too slow, I'd run strsplit on the entire column and then rearrange the data to my liking:

l = strsplit(dt$ObjectWidth, ",", fixed = T)

dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))),
   .(File, FruitNum)][, col := unlist(l)][]
like image 26
eddi Avatar answered Oct 04 '22 18:10

eddi