I was browsing some answer concerning strsplit
in R. Example text:
fileName <- c("hello.w-rp-al",
"how.nez-r",
"do.qs-sdz",
"you.d-aerd",
"do.dse-e")
I wanted to get the first element of the created list and thought I could use something such as
fileNameSplit <- strsplit(fileName, "[.]")
node_1 <- fileNameSplit[0]
node_2 <- fileNameSplit[1]
But that didn't work.
Then I found this answer that suggests using sapply
with [
. This does work.
d <- data.frame(fileName)
fileNameSplit <- strsplit(d$fileName, "[.]")
d$node_1 <- sapply(fileNameSplit, "[", 1)
d$node_2 <- sapply(fileNameSplit, "[", 2)
However, I'm trying to figure out why. What exactly is happening, and what does [
have to do with anything? It's semantically confusing in my opinion.
sapply
operates on lists, which are vectors where each element can take any form.
In the special case of your fileNameSplit
list, we know that each element of the list is a character vector with two elements.
> fileNameSplit
[[1]]
[1] "hello" "w-rp-al"
[[2]]
[1] "how" "nez-r"
[[3]]
[1] "do" "qs-sdz"
[[4]]
[1] "you" "d-aerd"
[[5]]
[1] "do" "dse-e"
To extract the first element from each of these character vectors, we have to iterate over the list, which is what
sapply(fileNameSplit, `[`, 1)
does. It may be clearer when written as
sapply(fileNameSplit, function(x) x[1])
The documentation at ?`[`
and ?sapply
explains why the shorter version works.
We use 1
because that is where indexing starts in R (unlike other languages that start at 0).
R is very LisP-like. The symbol [
is actually a function. When you write mylist[1]
, what is actually happening "under the hood" is that the numbered or named items (only one in this instance) inside the flanking square brackets are extracted and passed to the [
function from 'mylist` which became the first function argument, so it becomes:
`[`(mylist, 1) # that will also succeed if you type it at the command line
Both sapply
and lapply
have a trailing triple-dots argument. So the series of items being passed to [
as it first arguments are just the values inside fileNameSplit
's sublists and the 1
is being recycled as a second argument, and you, therefore get the first item in each of those sublists. The sapply
function creates a series of calls like:
`[`(mylist[[1]], 1) # as the first one with 2,3, ... in the [[.]] for succeeding calls
And then retruns them as a matrix or a list (depending on whether they are all the same length and the setting of the "simplify" argument.)
Because you used sapply
with no "simplify" arg, the default TRUE gets used and the value gets passed to simplify2array
and comes back to you as a vector-result, instead of the list that would have been returned had you just used lapply
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With