Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting nth element from a nested list following strsplit - R

Tags:

r

sapply

strsplit

I've been trying to understand how to deal with the output of strsplit a bit better. I often have data such as this that I wish to split:

mydata <- c("144/4/5", "154/2", "146/3/5", "142", "143/4", "DNB", "90")

#[1] "144/4/5" "154/2"   "146/3/5" "142"     "143/4"   "DNB"     "90"     

After splitting that the results are as follows:

strsplit(mydata, "/")

#[[1]]
#[1] "144" "4"   "5"  

#[[2]]
#[1] "154" "2"  

#[[3]]
#[1] "146" "3"   "5"  

#[[4]]
#[1] "142"

#[[5]]
#[1] "143" "4"  

#[[6]]
#[1] "DNB"

#[[7]]
#[1] "90"

I know from the strsplit help guide that final empty strings are not produced. Therefore, there will be 1, 2 or 3 elements in each of my results based on the number of "/" to split by

Getting the first element is very trivial:

sapply(strsplit(mydata, "/"), "[[", 1)

#[1] "144" "154" "146" "142" "143" "DNB" "90" 

But I am not sure how to get the 2nd, 3rd... when there are these unequal number of elements in each result.

sapply(strsplit(mydata, "/"), "[[", 2)

# Error in FUN(X[[4L]], ...) : subscript out of bounds

I would hope to return from a working solution, the following:

#[1] "4" "2" "3" "NA" "4" "NA" "NA" 

This is a relatively small example. I could do some for loop very easily on these data, but for real data with 1000s of observations to run the strsplit on and dozens of elements produced from that, I was hoping to find a more generalizable solution.

like image 804
jalapic Avatar asked Sep 01 '14 15:09

jalapic


1 Answers

(at least regarding 1D vectors) [ seems to return NA when "i > length(x)" whereas [[ returns an error.

x = runif(5)
x[6]
#[1] NA
x[[6]]
#Error in x[[6]] : subscript out of bounds

Digging a bit, do_subset_dflt (i.e. [) calls ExtractSubset where we notice that when a wanted index ("ii") is "> length(x)" NA is returned (a bit modified to be clean):

if(0 <= ii && ii < nx && ii != NA_INTEGER)
    result[i] = x[ii];
else
    result[i] = NA_INTEGER;

On the other hand do_subset2_dflt (i.e. [[) returns an error if the wanted index ("offset") is "> length(x)" (modified a bit to be clean):

if(offset < 0 || offset >= xlength(x)) {
    if(offset < 0 && (isNewList(x)) ...
    else errorcall(call, R_MSG_subs_o_b);
}

where #define R_MSG_subs_o_b _("subscript out of bounds")

(I'm not sure about the above code snippets but they do seem relevant based on their returns)

like image 131
alexis_laz Avatar answered Nov 09 '22 01:11

alexis_laz