I got a weird result today. To replicate it, consider the following data frames: <pre class="prettyprint"><code>x <- data.frame(x=1:3, y=11:13) y <- x[1:3, 1:2] </code></pre> They are supposed to be and actually are identical: <pre class="prettyprint"><code>identical(x,y) # [1] TRUE </code></pre> Applying <code>t()</code> to indentical objects should produce the same result, but: <pre class="prettyprint"><code>identical(t(x),t(y)) # [1] FALSE </code></pre> The difference is in the column names: <pre class="prettyprint"><code>colnames(t(x)) # NULL colnames(t(y)) # [1] "1" "2" "3" </code></pre> Given this, if you want to stack <code>y</code> by columns, you get what you'd expect: <pre class="prettyprint"><code>stack(as.data.frame(t(y))) # values ind # 1 1 1 # 2 11 1 # 3 2 2 # 4 12 2 # 5 3 3 # 6 13 3 </code></pre> while: <pre class="prettyprint"><code>stack(as.data.frame(t(x))) # values ind # 1 1 V1 # 2 11 V1 # 3 2 V2 # 4 12 V2 # 5 3 V3 # 6 13 V3 </code></pre> In the latter case, <code>as.data.frame()</code> does not find the original column names and automatically generates them. The culprit is in <code>as.matrix()</code>, called by <code>t()</code>: <pre class="prettyprint"><code>rownames(as.matrix(x)) # NULL rownames(as.matrix(y)) # [1] "1" "2" "3" </code></pre> A workaround is to set <code>rownames.force</code>: <pre class="prettyprint"><code>rownames(as.matrix(x, rownames.force=TRUE)) # [1] "1" "2" "3" rownames(as.matrix(y, rownames.force=TRUE)) # [1] "1" "2" "3" identical(t(as.matrix(x, rownames.force=TRUE)), t(as.matrix(y, rownames.force=TRUE))) # [1] TRUE </code></pre> (and rewrite <code>stack(...)</code> call accordingly.) My questions are: <ol> <li>Why <code>as.matrix()</code> treats differently <code>x</code> and <code>y</code> and </li> <li>how can you tell the difference between them?</li> </ol> Note that other info functions do not reveal differences between <code>x, y</code>: <pre class="prettyprint"><code>identical(attributes(x), attributes(y)) # [1] TRUE identical(str(x), str(y)) # ... #[1] TRUE </code></pre> <h3>Comments to solutions</h3> Konrad Rudolph gives a concise but effective explanation to the behaviour outlined above (see also mt1022 for more details). In short Konrad shows that: a) <code>x</code> and <code>y</code> are internally different; b) "<code>identical</code> is too is simply too lax by default" to catch this internal difference. Now, if you take a subset <code>T</code> of the set <code>S</code>, which has all the elements of <code>S</code>, then <code>S</code> and <code>T</code> are exactly the same objects. So, if you take a data frame <code>y</code>, which has all the rows and columns of <code>x</code>, then <code>x</code> and <code>y</code> should be exactly the same objects. Unfortunately <code>x \neq y</code>! This behaviour is not only counterintuitive but also obfuscated, that is, the difference is not self evident, but only internal and even the default <code>identical</code> function can't see it. Another natural principle is that transposing two identical (matrix-like) objects produces identical objects. Again, this is broken by the fact that, before transposing, <code>identical</code> is "too lax"; after transposing, the default <code>identical</code> is enough to see the difference. IMHO this behaviour (even if it is not a bug) is a misbehaviour for a scientific language like R. Hopefully this post will drive some attention and the R team will consider to revise it.

As in comment, <code>x</code> and <code>y</code> are not strictly the same. When we call <code>t</code> to <code>data.frame</code>, <code>t.data.frame</code> will be executed: <pre class="prettyprint"><code>function (x) { x <- as.matrix(x) NextMethod("t") } </code></pre> As we can see, it calls <code>as.matrix</code>, i.e. <code>as.matrix.data.frame</code>: <pre class="prettyprint"><code>function (x, rownames.force = NA, ...) { dm <- dim(x) rn <- if (rownames.force %in% FALSE) NULL else if (rownames.force %in% TRUE) row.names(x) else if (.row_names_info(x) <= 0L) NULL else row.names(x) ... </code></pre> As commented by @oropendola, the return of <code>.row_names_info</code> of <code>x</code> and <code>y</code> are different and The above function is where the difference takes effect. Then why <code>y</code> has different <code>rownames</code>? Let's look at <code>[.data.frame</code>, I have added comment at key lines: <pre class="prettyprint"><code>{ ... # many lines of code xx <- x #!! this is where xx is defined cols <- names(xx) x <- vector("list", length(x)) x <- .Internal(copyDFattr(xx, x)) # This is where I am not sure about oldClass(x) <- attr(x, "row.names") <- NULL if (has.j) { nm <- names(x) if (is.null(nm)) nm <- character() if (!is.character(j) && anyNA(nm)) names(nm) <- names(x) <- seq_along(x) x <- x[j] cols <- names(x) if (drop && length(x) == 1L) { if (is.character(i)) { rows <- attr(xx, "row.names") i <- pmatch(i, rows, duplicates.ok = TRUE) } xj <- .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (anyNA(cols)) stop("undefined columns selected") if (!is.null(names(nm))) cols <- names(x) <- nm[cols] nxx <- structure(seq_along(xx), names = names(xx)) sxx <- match(nxx[j], seq_along(xx)) } else sxx <- seq_along(x) rows <- NULL ## this is where rows is defined, as we give numeric i, the following ## if block will not be executed if (is.character(i)) { rows <- attr(xx, "row.names") i <- pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj <- xx[[sxx[j]]] x[[j]] <- if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n <- length(x) if (n == 1L) return(x[[1L]]) if (n > 1L) { xj <- x[[1L]] nrow <- if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop <- !mdrop && nrow == 1L } else drop <- FALSE } if (!drop) { ## drop is False for our case if (is.null(rows)) rows <- attr(xx, "row.names") ## rows changed from NULL to 1,2,3 here rows <- rows[i] if ((ina <- anyNA(rows)) | (dup <- anyDuplicated(rows))) { if (!dup && is.character(rows)) dup <- "NA" %in% rows if (ina) rows[is.na(rows)] <- "NA" if (dup) rows <- make.unique(as.character(rows)) } if (has.j && anyDuplicated(nm <- names(x))) names(x) <- make.unique(nm) if (is.null(rows)) rows <- attr(xx, "row.names")[i] attr(x, "row.names") <- rows ## this is where the rownames of x changed oldClass(x) <- oldClass(xx) } x } </code></pre> we can see that <code>y</code> get its names by something like <code>attr(x, 'row.names')</code>: <pre class="prettyprint"><code>> attr(x, 'row.names') [1] 1 2 3 </code></pre> So when we created <code>y</code> with <code>[.data.frame</code>, it receives <code>row.names</code> attributes that are different from <code>x</code>, of which the <code>row.names</code> are automatic and indicated with negative sign in <code>dput</code> results. <hr> <h3>edit</h3> Actually, this has been stated in manual of <code>row.names</code>: <blockquote> Note row.names is similar to rownames for arrays, and it has a method that calls rownames for an array argument. Row names of the form 1:n for n > 2 are stored internally in a compact form, which might be seen from C code or by deparsing but never via row.names or attr(x, "row.names"). Additionally, some names of this sort are marked as ‘automatic’ and handled differently by as.matrix and data.matrix (and potentially other functions). </blockquote> So <code>attr</code> doesn't discriminate between automatic <code>row.names</code> (like that of <code>x</code>) and explicit interger <code>row.names</code> (like that of <code>y</code>), while this is discriminated by <code>as.matrix</code> through internal representation <code>.row_names_info</code>.

Transposing identical objects

Tags:

r

matrix

transpose

I got a weird result today.

To replicate it, consider the following data frames:

x <- data.frame(x=1:3, y=11:13)
y <- x[1:3, 1:2]

They are supposed to be and actually are identical:

identical(x,y)
# [1] TRUE

Applying t() to indentical objects should produce the same result, but:

identical(t(x),t(y))
# [1] FALSE

The difference is in the column names:

colnames(t(x))
# NULL
colnames(t(y))
# [1] "1" "2" "3"

Given this, if you want to stack y by columns, you get what you'd expect:

stack(as.data.frame(t(y)))
#   values ind
# 1      1   1
# 2     11   1
# 3      2   2
# 4     12   2
# 5      3   3
# 6     13   3

while:

stack(as.data.frame(t(x)))
#     values ind
# 1      1  V1
# 2     11  V1
# 3      2  V2
# 4     12  V2
# 5      3  V3
# 6     13  V3

In the latter case, as.data.frame() does not find the original column names and automatically generates them.

The culprit is in as.matrix(), called by t():

rownames(as.matrix(x))
# NULL
rownames(as.matrix(y))
# [1] "1" "2" "3"

A workaround is to set rownames.force:

rownames(as.matrix(x, rownames.force=TRUE))
# [1] "1" "2" "3"
rownames(as.matrix(y, rownames.force=TRUE))
# [1] "1" "2" "3"
identical(t(as.matrix(x, rownames.force=TRUE)), 
          t(as.matrix(y, rownames.force=TRUE)))
# [1] TRUE

(and rewrite stack(...) call accordingly.)

My questions are:

Why as.matrix() treats differently x and y and
how can you tell the difference between them?

Note that other info functions do not reveal differences between x, y:

identical(attributes(x), attributes(y))
# [1] TRUE
identical(str(x), str(y))
# ...
#[1] TRUE

Comments to solutions

Konrad Rudolph gives a concise but effective explanation to the behaviour outlined above (see also mt1022 for more details).

In short Konrad shows that:

a) x and y are internally different;
b) "identical is too is simply too lax by default" to catch this internal difference.

Now, if you take a subset T of the set S, which has all the elements of S, then S and T are exactly the same objects. So, if you take a data frame y, which has all the rows and columns of x, then x and y should be exactly the same objects. Unfortunately x \neq y!
This behaviour is not only counterintuitive but also obfuscated, that is, the difference is not self evident, but only internal and even the default identical function can't see it.

Another natural principle is that transposing two identical (matrix-like) objects produces identical objects. Again, this is broken by the fact that, before transposing, identical is "too lax"; after transposing, the default identical is enough to see the difference.

IMHO this behaviour (even if it is not a bug) is a misbehaviour for a scientific language like R.
Hopefully this post will drive some attention and the R team will consider to revise it.

697

asked Apr 04 '17 14:04

antonio

2 Answers

identical is simply too lax by default but you can change that:

> identical(x, y, attrib.as.set = FALSE)
[1] FALSE

The reason can be found by inspecting the objects in more detail:

> dput(x)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
-3L), class = "data.frame")
> dput(y)
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA,
3L), class = "data.frame")

Note the distinct row.names attributes:

> .row_names_info(x)
[1] -3
> .row_names_info(y)
[1] 3

From the documentation we can glean that a negative number implies automatic rownames (for x), whereas y’s row names aren’t automatic. And as.matrix treats them differently.

158

answered Sep 20 '22 08:09

Konrad Rudolph

As in comment, x and y are not strictly the same. When we call t to data.frame, t.data.frame will be executed:

function (x) 
{
    x <- as.matrix(x)
    NextMethod("t")
}

As we can see, it calls as.matrix, i.e. as.matrix.data.frame:

function (x, rownames.force = NA, ...) 
{
    dm <- dim(x)
    rn <- if (rownames.force %in% FALSE) 
        NULL
    else if (rownames.force %in% TRUE) 
        row.names(x)
    else if (.row_names_info(x) <= 0L) 
        NULL
    else row.names(x)
...

As commented by @oropendola, the return of .row_names_info of x and y are different and The above function is where the difference takes effect.

Then why y has different rownames? Let's look at [.data.frame, I have added comment at key lines:

{
    ... # many lines of code
    xx <- x  #!! this is where xx is defined
    cols <- names(xx)
    x <- vector("list", length(x))
    x <- .Internal(copyDFattr(xx, x))  # This is where I am not sure about
    oldClass(x) <- attr(x, "row.names") <- NULL
    if (has.j) {
        nm <- names(x)
        if (is.null(nm)) 
            nm <- character()
        if (!is.character(j) && anyNA(nm)) 
            names(nm) <- names(x) <- seq_along(x)
        x <- x[j]
        cols <- names(x)
        if (drop && length(x) == 1L) {
            if (is.character(i)) {
                rows <- attr(xx, "row.names")
                i <- pmatch(i, rows, duplicates.ok = TRUE)
            }
            xj <- .subset2(.subset(xx, j), 1L)
            return(if (length(dim(xj)) != 2L) xj[i] else xj[i, 
                                                            , drop = FALSE])
        }
        if (anyNA(cols)) 
            stop("undefined columns selected")
        if (!is.null(names(nm))) 
            cols <- names(x) <- nm[cols]
        nxx <- structure(seq_along(xx), names = names(xx))
        sxx <- match(nxx[j], seq_along(xx))
    }
    else sxx <- seq_along(x)
    rows <- NULL ## this is where rows is defined, as we give numeric i, the following
    ## if block will not be executed
    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        i <- pmatch(i, rows, duplicates.ok = TRUE)
    }
    for (j in seq_along(x)) {
        xj <- xx[[sxx[j]]]
        x[[j]] <- if (length(dim(xj)) != 2L) 
            xj[i]
        else xj[i, , drop = FALSE]
    }
    if (drop) {
        n <- length(x)
        if (n == 1L) 
            return(x[[1L]])
        if (n > 1L) {
            xj <- x[[1L]]
            nrow <- if (length(dim(xj)) == 2L) 
                dim(xj)[1L]
            else length(xj)
            drop <- !mdrop && nrow == 1L
        }
        else drop <- FALSE
    }
    if (!drop) { ## drop is False for our case
        if (is.null(rows)) 
            rows <- attr(xx, "row.names")  ## rows changed from NULL to 1,2,3 here
        rows <- rows[i]
        if ((ina <- anyNA(rows)) | (dup <- anyDuplicated(rows))) {
            if (!dup && is.character(rows)) 
                dup <- "NA" %in% rows
            if (ina) 
                rows[is.na(rows)] <- "NA"
            if (dup) 
                rows <- make.unique(as.character(rows))
        }
        if (has.j && anyDuplicated(nm <- names(x))) 
            names(x) <- make.unique(nm)
        if (is.null(rows)) 
            rows <- attr(xx, "row.names")[i]
        attr(x, "row.names") <- rows  ## this is where the rownames of x changed
        oldClass(x) <- oldClass(xx)
    }
    x
}

we can see that y get its names by something like attr(x, 'row.names'):

> attr(x, 'row.names')
[1] 1 2 3

So when we created y with [.data.frame, it receives row.names attributes that are different from x, of which the row.names are automatic and indicated with negative sign in dput results.

edit

Actually, this has been stated in manual of row.names:

Note

row.names is similar to rownames for arrays, and it has a method that calls rownames for an array argument.

Row names of the form 1:n for n > 2 are stored internally in a compact form, which might be seen from C code or by deparsing but never via row.names or attr(x, "row.names"). Additionally, some names of this sort are marked as ‘automatic’ and handled differently by as.matrix and data.matrix (and potentially other functions).

So attr doesn't discriminate between automatic row.names (like that of x) and explicit interger row.names (like that of y), while this is discriminated by as.matrix through internal representation .row_names_info.

answered Sep 19 '22 08:09

mt1022

Related questions
                            
                                Permute a vector such that an element can't be in the same place
                            
                                Using Unicode inside R's expression() command
                            
                                R: Why does dbWriteTable fail when table exists despite 'append = TRUE'
                            
                                Shiny App unable to start on shiny server
                            
                                Create UML diagrams directly from R code
                            
                                Inserting control inputs and HTML widgets inside rhandsontable cells in shiny
                            
                                How to read a parquet file in R without using spark packages?
                            
                                R data.table weird value/reference semantics
                            
                                Install R Studio Server on Windows
                            
                                Using standard evaluation and do_ to run simulations on a grid of parameters without do.call
                            
                                Optimising Shiny + Leaflet performance for detailed maps with many 'layers'
                            
                                'make'-like dependency-tracking library?
                            
                                How to dodge pointrange ggplots on two levels?
                            
                                Can I use knitr to apply CSS styles to individual table cells?
                            
                                How to extract create statements from different tables of MySQL DBs?
                            
                                Assignment to empty index (empty square brackets x[]<-) on LHS
                            
                                How to show R graph from visual studio code
                            
                                How to read big json?
                            
                                permission denied error while building/checking R package on WIndows
                            
                                Cross-referencing in a single-file bookdown document

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With