Could someone explain me why I get different results in my last two lines of code (<code>identical()</code> calls) below? These two objects seem to be identical objects, but when I use them in an apply function, I get some trouble: <pre class="prettyprint"><code>df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5)) df_ab <- df[,c(1,2)] df_AB <- subset(df, select = c(1,2)) identical(df_ab,df_AB) [1] TRUE apply(df_ab,2,function(x) identical(1:5,x)) a b TRUE FALSE apply(df_AB,2,function(x) identical(1:5,x)) a b FALSE FALSE </code></pre>

The <code>apply()</code> function coerces its first argument to a matrix before calling the function on each column. So your data frames are coerced to matrix objects. A consequence of that conversion is that <code>as.matrix(df_AB)</code> has non-null rownames, while <code>as.matrix(df_ab)</code> does not: <pre class="prettyprint"><code>> str(as.matrix(df_ab)) int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2 - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "a" "b" > str(as.matrix(df_AB)) int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2 - attr(*, "dimnames")=List of 2 ..$ : chr [1:5] "1" "2" "3" "4" ... ..$ : chr [1:2] "a" "b" </code></pre> So when you <code>apply()</code> subset a column of <code>df_AB</code>, you get a named vector, which is not identical to an unnamed vector. <pre class="prettyprint"><code>apply(df_AB, 2, str) Named int [1:5] 1 2 3 4 5 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ... Named int [1:5] 6 5 4 3 2 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ... NULL </code></pre> Contrast that with the <code>subset()</code> function, which selects rows using a logical vector for the value of <code>i</code>. And it looks like subsetting a data.frame with a non-missing value for <code>i</code> causes this difference in the <code>row.names</code> attribute: <pre class="prettyprint"><code>> str(as.matrix(df[1:5, 1:2])) int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2 - attr(*, "dimnames")=List of 2 ..$ : chr [1:5] "1" "2" "3" "4" ... ..$ : chr [1:2] "a" "b" > str(as.matrix(df[, 1:2])) int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2 - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "a" "b" </code></pre> You can see the all the gory details of the difference between the data.frames using the <code>.Internal(inspect(x))</code> function. You can look at those yourself, if you're interested. As Roland pointed out in his comments, you can use the <code>.row_names_info()</code> function to see the differences in only the row names. Notice that when <code>i</code> is missing, the result of <code>.row_names_info()</code> is negative, but it is positive if you subset with a non-missing <code>i</code>. <pre class="prettyprint"><code>> .row_names_info(df_ab, type=1) [1] -5 > .row_names_info(df_AB, type=1) [1] 5 </code></pre> What these values mean is explained in <code>?.row_names_info</code>: <blockquote> <pre class="prettyprint"><code>type: integer. Currently ‘type = 0’ returns the internal ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the number of rows implied by the attribute, and ‘type = 1’ the latter with a negative sign for ‘automatic’ row names. </code></pre> </blockquote>

If you want to compare the values <code>1:5</code> with the values in the columns, you should not use <code>apply</code> since <code>apply</code> transforms the data frames to matrices before the functions are applied. Due to the row names in the subset created with <code>[</code> (see @Joshua Ulrich's answer), the values <code>1:5</code> are not identical to a named vector including the same values. You should instead use <code>sapply</code> to apply the <code>identical</code> function to the columns. This avoids transforming the data frames to matrices: <pre class="prettyprint"><code>> sapply(df_ab, identical, 1:5) a b TRUE FALSE > sapply(df_AB, identical, 1:5) a b TRUE FALSE </code></pre> As you can see, in both data frames the values in the first column are identical to <code>1:5</code>.

In one version (using <code>[</code>) your columns are integers, while in the other version (using <code>subset</code>) your columns are named integers. <pre class="prettyprint"><code>apply(df_ab, 2, str) int [1:5] 1 2 3 4 5 int [1:5] 6 5 4 3 2 NULL apply(df_AB, 2, str) Named int [1:5] 1 2 3 4 5 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ... Named int [1:5] 6 5 4 3 2 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ... NULL </code></pre>

Looking at the structure of those two object s before they get submitted to <code>apply</code> shows only one difference: in the rownames, but not a difference that I would have expected to produce the difference you are seeing. I do not see Joshua's current offer of 'subset' as logical indexing as explaining this. Why <code>row.names = c(NA, -5L))</code> produces a named result when extracting with "[" is as yet unexplained. <pre class="prettyprint"><code>> dput(df_AB) structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", "b"), row.names = c(NA, 5L), class = "data.frame") > dput(df_ab) structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", "b"), class = "data.frame", row.names = c(NA, -5L)) </code></pre> I do agree that it is the as.matrix coercion which needs further investigation: <pre class="prettyprint"><code>> attributes(df_AB[,1]) NULL > attributes(df_ab[,1]) NULL > attributes(as.matrix(df_AB)[,1]) $names [1] "1" "2" "3" "4" "5" </code></pre>

"subset" and "[" on dataframe give slightly different results, why?

Tags:

dataframe

r

matrix

subset

rowname

Could someone explain me why I get different results in my last two lines of code (identical() calls) below? These two objects seem to be identical objects, but when I use them in an apply function, I get some trouble:

df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5))
df_ab <- df[,c(1,2)]
df_AB <- subset(df, select = c(1,2))
identical(df_ab,df_AB)
[1] TRUE

apply(df_ab,2,function(x) identical(1:5,x))
    a     b 
TRUE FALSE

apply(df_AB,2,function(x) identical(1:5,x))
    a     b 
FALSE FALSE

208

asked Oct 20 '14 14:10

mbh86

4 Answers

The apply() function coerces its first argument to a matrix before calling the function on each column. So your data frames are coerced to matrix objects. A consequence of that conversion is that as.matrix(df_AB) has non-null rownames, while as.matrix(df_ab) does not:

> str(as.matrix(df_ab))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df_AB))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"

So when you apply() subset a column of df_AB, you get a named vector, which is not identical to an unnamed vector.

apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

Contrast that with the subset() function, which selects rows using a logical vector for the value of i. And it looks like subsetting a data.frame with a non-missing value for i causes this difference in the row.names attribute:

> str(as.matrix(df[1:5, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df[, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"

You can see the all the gory details of the difference between the data.frames using the .Internal(inspect(x)) function. You can look at those yourself, if you're interested.

As Roland pointed out in his comments, you can use the .row_names_info() function to see the differences in only the row names.

Notice that when i is missing, the result of .row_names_info() is negative, but it is positive if you subset with a non-missing i.

> .row_names_info(df_ab, type=1)
[1] -5
> .row_names_info(df_AB, type=1)
[1] 5

What these values mean is explained in ?.row_names_info:

type: integer.  Currently ‘type = 0’ returns the internal
      ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the
      number of rows implied by the attribute, and ‘type = 1’ the
      latter with a negative sign for ‘automatic’ row names.

answered Oct 08 '22 05:10

Joshua Ulrich

If you want to compare the values 1:5 with the values in the columns, you should not use apply since apply transforms the data frames to matrices before the functions are applied. Due to the row names in the subset created with [ (see @Joshua Ulrich's answer), the values 1:5 are not identical to a named vector including the same values.

You should instead use sapply to apply the identical function to the columns. This avoids transforming the data frames to matrices:

> sapply(df_ab, identical, 1:5)
    a     b 
 TRUE FALSE 
> sapply(df_AB, identical, 1:5)
    a     b 
 TRUE FALSE

As you can see, in both data frames the values in the first column are identical to 1:5.

answered Oct 08 '22 04:10

Sven Hohenstein

In one version (using [) your columns are integers, while in the other version (using subset) your columns are named integers.

apply(df_ab, 2, str)

 int [1:5] 1 2 3 4 5
 int [1:5] 6 5 4 3 2
NULL


apply(df_AB, 2, str)

 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

answered Oct 08 '22 04:10

Andrie

Looking at the structure of those two object s before they get submitted to apply shows only one difference: in the rownames, but not a difference that I would have expected to produce the difference you are seeing. I do not see Joshua's current offer of 'subset' as logical indexing as explaining this. Why row.names = c(NA, -5L)) produces a named result when extracting with "[" is as yet unexplained.

> dput(df_AB)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), row.names = c(NA, 5L), class = "data.frame")
> dput(df_ab)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), class = "data.frame", row.names = c(NA, -5L))

I do agree that it is the as.matrix coercion which needs further investigation:

> attributes(df_AB[,1])
NULL
> attributes(df_ab[,1])
NULL
> attributes(as.matrix(df_AB)[,1])
$names
[1] "1" "2" "3" "4" "5"

answered Oct 08 '22 03:10

IRTFM

Related questions
                            
                                strptime, as.POSIXct and as.Date return unexpected NA
                            
                                Reshape wide format, to multi-column long format
                            
                                as.Date(as.POSIXct()) gives the wrong date?
                            
                                How to round a time?
                            
                                How can I avoid having my R script printed every time I run it?
                            
                                rowMeans function in dplyr
                            
                                Is `if` faster than ifelse?
                            
                                Are there raw strings in R for regular expressions?
                            
                                Group by columns and summarize a column into a list
                            
                                How to Switch Between NavBar Tabs with a Button R Shiny
                            
                                How can I parse CSV data from a character vector to extract a data frame?
                            
                                How to Parse Year + Week Number in R?
                            
                                Replacing all occurrences of a pattern in a string
                            
                                Argument is of length zero
                            
                                Changing the Color of negative numbers to Red in a table generated with xtable()?
                            
                                heatmap-like plot, but for categorical variables
                            
                                Return the character associated with the specified Ascii code in R
                            
                                Set global thousand separator on knitr
                            
                                Lazy sequences in R
                            
                                Shift values in single column of dataframe up

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With