Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does 'out of bounds' indexing differ between a matrix and a data.frame?

I'm sure this is kind of basic, but I'd just like to really understand the logic of R data structures here.

If I subset a matrix by index out of bounds, I get exactly that error:

m <- matrix(data = c("foo", "bar"), nrow = 1)
m[2,]
# Error in m[2, ] : subscript out of bounds

If I do the same do a data frame, however, I get all NA rows:

df <- data.frame(foo = "foo", bar = "bar")
df[2,]
#    foo  bar
# NA <NA> <NA>

If I subset into a non-existent data frame column I get the familiar

df[, 3]
# Error in `[.data.frame`(df, , 3) : undefined columns selected

I know (roughly) that data frame rows are weird and to be treated carefully, but I don't quite see the connection to the above behavior.

Can someone explain why R behaves in this way for non-existent df rows?

Update

To be sure, giving NA on out-of-bounds subsets, is normal R behavior for 1D vectors:

vec <- c("foo", "bar")
vec[3]
# [1] NA

So in a way, the weird one out here is matrix subsetting, not dataframe subsetting, depending from where you're starting out. Still the different 2D subsetting behavior (m[2, ] vs df[2, ]) might strike a dense user (as I am right now) as inconsistent.

like image 815
maxheld Avatar asked Nov 23 '18 14:11

maxheld


People also ask

How does Matrix differ from Dataframe?

Both represent 'rectangular' data types, meaning that they are used to store tabular data, with rows and columns. The main difference, as you'll see, is that matrices can only contain a single class of data, while data frames can consist of many different classes of data.

What is the difference between Cbind and data frame?

The data. frame() function works very similarly to cbind() – the only difference is that in data. frame() you specify names to each of the columns as you define them. Again, unlike matrices, dataframes can contain both string vectors and numeric vectors within the same object.


1 Answers

Can someone explain why R behaves in this way[?]

Short answer: No, probably not.


Longer answer: Once upon a time I was thinking about something similar and read this thread on R-devel: Definition of [[. Basically it boils down to:

The semantics of [ and [[ don't seem to be fully specified in the Reference manual. [...] I assume that these are features, not bugs, but I can't find documentation for them

Duncan Murdoch, a former member of the R core team gives a very nice reply:

There is more documentation in the man page for Extract, but I think it is incomplete. The most complete documentation is of course the source code*, but it may not answer the question of what's intentional and what's accidental

As mentioned in the R-devel thread, the only description in the manual is 3.4.1 Indexing by vectors:

If i is positive and exceeds length(x) then the corresponding selection is NA

But, this applies to "indexing of simple vectors". Similar out of bounds indexing for "non-simple" vectors does not seem to be described. Duncan Murdoch again:

So what is a simple vector? That is not explicitly defined, and it probably should be.

Thus, it may seem like no one knows the answer to your why question.


See also "8.2.13 nonexistent value in subscript" in the excellent R Inferno by Patrick Burns, and the section "Missing/out of bounds indices" in Hadley's book.


*Source code for the [ subset operator. A search for R_MSG_subs_o_b (which corresponds to error message "subscript out of bounds") provides no obvious clue why OOB [ indexing of matrices and when using [[ give an error, whereas OOB [ indexing of "simple vectors" results in NA.

like image 133
Henrik Avatar answered Oct 21 '22 19:10

Henrik