I got a weird result today.
To replicate it, consider the following data frames:
x <- data.frame(x=1:3, y=11:13)
y <- x[1:3, 1:2]
They are supposed to be and actually are identical:
identical(x,y)
# [1] TRUE
Applying t()
to indentical objects should produce the same result, but:
identical(t(x),t(y))
# [1] FALSE
The difference is in the column names:
colnames(t(x))
# NULL
colnames(t(y))
# [1] "1" "2" "3"
Given this, if you want to stack y
by columns, you get what you'd expect:
stack(as.data.frame(t(y)))
# values ind
# 1 1 1
# 2 11 1
# 3 2 2
# 4 12 2
# 5 3 3
# 6 13 3
while:
stack(as.data.frame(t(x)))
# values ind
# 1 1 V1
# 2 11 V1
# 3 2 V2
# 4 12 V2
# 5 3 V3
# 6 13 V3
In the latter case, as.data.frame()
does not find the original column names and automatically generates them.
The culprit is in as.matrix()
, called by t()
:
rownames(as.matrix(x))
# NULL
rownames(as.matrix(y))
# [1] "1" "2" "3"
A workaround is to set rownames.force
:
rownames(as.matrix(x, rownames.force=TRUE))
# [1] "1" "2" "3"
rownames(as.matrix(y, rownames.force=TRUE))
# [1] "1" "2" "3"
identical(t(as.matrix(x, rownames.force=TRUE)),
t(as.matrix(y, rownames.force=TRUE)))
# [1] TRUE
(and rewrite stack(...)
call accordingly.)
My questions are:
Why as.matrix()
treats differently x
and y
and
how can you tell the difference between them?
Note that other info functions do not reveal differences between x, y
:
identical(attributes(x), attributes(y))
# [1] TRUE
identical(str(x), str(y))
# ...
#[1] TRUE
Konrad Rudolph gives a concise but effective explanation to the behaviour outlined above (see also mt1022 for more details).
In short Konrad shows that:
a) x
and y
are internally different;
b) "identical
is too is simply too lax by default" to catch this internal difference.
Now, if you take a subset T
of the set S
, which has all the elements of S
, then S
and T
are exactly the same objects. So, if you take a data frame y
, which has all the rows and columns of x
, then x
and y
should be exactly the same objects. Unfortunately x \neq y
!
This behaviour is not only counterintuitive but also obfuscated, that is, the difference is not self evident, but only internal and even the default identical
function can't see it.
Another natural principle is that transposing two identical (matrix-like) objects produces identical objects. Again, this is broken by the fact that, before transposing, identical
is "too lax"; after transposing, the default identical
is enough to see the difference.
IMHO this behaviour (even if it is not a bug) is a misbehaviour for a scientific language like R.
Hopefully this post will drive some attention and the R team will consider to revise it.
Description. Transposes an image by swapping its spatial dimensions.
Changing the key of a piece of music is called transposing the music. Music in a major key can be transposed to any other major key; music in a minor key can be transposed to any other minor key. (Changing a piece from minor to major or vice-versa requires many more changes than simple transposition.)
identical
is simply too lax by default but you can change that:
> identical(x, y, attrib.as.set = FALSE)
[1] FALSE
The reason can be found by inspecting the objects in more detail:
> dput(x)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
-3L), class = "data.frame")
> dput(y)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
3L), class = "data.frame")
Note the distinct row.names
attributes:
> .row_names_info(x)
[1] -3
> .row_names_info(y)
[1] 3
From the documentation we can glean that a negative number implies automatic rownames (for x
), whereas y
’s row names aren’t automatic. And as.matrix
treats them differently.
As in comment, x
and y
are not strictly the same. When we call t
to data.frame
, t.data.frame
will be executed:
function (x)
{
x <- as.matrix(x)
NextMethod("t")
}
As we can see, it calls as.matrix
, i.e. as.matrix.data.frame
:
function (x, rownames.force = NA, ...)
{
dm <- dim(x)
rn <- if (rownames.force %in% FALSE)
NULL
else if (rownames.force %in% TRUE)
row.names(x)
else if (.row_names_info(x) <= 0L)
NULL
else row.names(x)
...
As commented by @oropendola, the return of .row_names_info
of x
and y
are different and The above function is where the difference takes effect.
Then why y
has different rownames
? Let's look at [.data.frame
, I have added comment at key lines:
{
... # many lines of code
xx <- x #!! this is where xx is defined
cols <- names(xx)
x <- vector("list", length(x))
x <- .Internal(copyDFattr(xx, x)) # This is where I am not sure about
oldClass(x) <- attr(x, "row.names") <- NULL
if (has.j) {
nm <- names(x)
if (is.null(nm))
nm <- character()
if (!is.character(j) && anyNA(nm))
names(nm) <- names(x) <- seq_along(x)
x <- x[j]
cols <- names(x)
if (drop && length(x) == 1L) {
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
xj <- .subset2(.subset(xx, j), 1L)
return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
, drop = FALSE])
}
if (anyNA(cols))
stop("undefined columns selected")
if (!is.null(names(nm)))
cols <- names(x) <- nm[cols]
nxx <- structure(seq_along(xx), names = names(xx))
sxx <- match(nxx[j], seq_along(xx))
}
else sxx <- seq_along(x)
rows <- NULL ## this is where rows is defined, as we give numeric i, the following
## if block will not be executed
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
for (j in seq_along(x)) {
xj <- xx[[sxx[j]]]
x[[j]] <- if (length(dim(xj)) != 2L)
xj[i]
else xj[i, , drop = FALSE]
}
if (drop) {
n <- length(x)
if (n == 1L)
return(x[[1L]])
if (n > 1L) {
xj <- x[[1L]]
nrow <- if (length(dim(xj)) == 2L)
dim(xj)[1L]
else length(xj)
drop <- !mdrop && nrow == 1L
}
else drop <- FALSE
}
if (!drop) { ## drop is False for our case
if (is.null(rows))
rows <- attr(xx, "row.names") ## rows changed from NULL to 1,2,3 here
rows <- rows[i]
if ((ina <- anyNA(rows)) | (dup <- anyDuplicated(rows))) {
if (!dup && is.character(rows))
dup <- "NA" %in% rows
if (ina)
rows[is.na(rows)] <- "NA"
if (dup)
rows <- make.unique(as.character(rows))
}
if (has.j && anyDuplicated(nm <- names(x)))
names(x) <- make.unique(nm)
if (is.null(rows))
rows <- attr(xx, "row.names")[i]
attr(x, "row.names") <- rows ## this is where the rownames of x changed
oldClass(x) <- oldClass(xx)
}
x
}
we can see that y
get its names by something like attr(x, 'row.names')
:
> attr(x, 'row.names')
[1] 1 2 3
So when we created y
with [.data.frame
, it receives row.names
attributes that are different from x
, of which the row.names
are automatic and indicated with negative sign in dput
results.
Actually, this has been stated in manual of row.names
:
Note
row.names is similar to rownames for arrays, and it has a method that calls rownames for an array argument.
Row names of the form 1:n for n > 2 are stored internally in a compact form, which might be seen from C code or by deparsing but never via row.names or attr(x, "row.names"). Additionally, some names of this sort are marked as ‘automatic’ and handled differently by as.matrix and data.matrix (and potentially other functions).
So attr
doesn't discriminate between automatic row.names
(like that of x
) and explicit interger row.names
(like that of y
), while this is discriminated by as.matrix
through internal representation .row_names_info
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With