I'm sure this is kind of basic, but I'd just like to really understand the logic of R data structures here.
If I subset a matrix by index out of bounds, I get exactly that error:
m <- matrix(data = c("foo", "bar"), nrow = 1)
m[2,]
# Error in m[2, ] : subscript out of bounds
If I do the same do a data frame, however, I get all NA
rows:
df <- data.frame(foo = "foo", bar = "bar")
df[2,]
# foo bar
# NA <NA> <NA>
If I subset into a non-existent data frame column I get the familiar
df[, 3]
# Error in `[.data.frame`(df, , 3) : undefined columns selected
I know (roughly) that data frame rows are weird and to be treated carefully, but I don't quite see the connection to the above behavior.
Can someone explain why R behaves in this way for non-existent df rows?
Update
To be sure, giving NA
on out-of-bounds subsets, is normal R behavior for 1D vectors:
vec <- c("foo", "bar")
vec[3]
# [1] NA
So in a way, the weird one out here is matrix subsetting, not dataframe subsetting, depending from where you're starting out.
Still the different 2D subsetting behavior (m[2, ]
vs df[2, ]
) might strike a dense user (as I am right now) as inconsistent.
Both represent 'rectangular' data types, meaning that they are used to store tabular data, with rows and columns. The main difference, as you'll see, is that matrices can only contain a single class of data, while data frames can consist of many different classes of data.
The data. frame() function works very similarly to cbind() – the only difference is that in data. frame() you specify names to each of the columns as you define them. Again, unlike matrices, dataframes can contain both string vectors and numeric vectors within the same object.
Can someone explain why R behaves in this way[?]
Short answer: No, probably not.
Longer answer:
Once upon a time I was thinking about something similar and read this thread on R-devel: Definition of [[
. Basically it boils down to:
The semantics of
[
and[[
don't seem to be fully specified in the Reference manual. [...] I assume that these are features, not bugs, but I can't find documentation for them
Duncan Murdoch, a former member of the R core team gives a very nice reply:
There is more documentation in the man page for
Extract
, but I think it is incomplete. The most complete documentation is of course the source code*, but it may not answer the question of what's intentional and what's accidental
As mentioned in the R-devel thread, the only description in the manual is 3.4.1 Indexing by vectors:
If
i
is positive and exceedslength(x)
then the corresponding selection isNA
But, this applies to "indexing of simple vectors". Similar out of bounds indexing for "non-simple" vectors does not seem to be described. Duncan Murdoch again:
So what is a simple vector? That is not explicitly defined, and it probably should be.
Thus, it may seem like no one knows the answer to your why question.
See also "8.2.13 nonexistent value in subscript" in the excellent R Inferno by Patrick Burns, and the section "Missing/out of bounds indices" in Hadley's book.
*Source code for the [
subset operator. A search for R_MSG_subs_o_b
(which corresponds to error message "subscript out of bounds") provides no obvious clue why OOB [
indexing of matrices and when using [[
give an error, whereas OOB [
indexing of "simple vectors" results in NA
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With