Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"subset" and "[" on dataframe give slightly different results, why?

Could someone explain me why I get different results in my last two lines of code (identical() calls) below? These two objects seem to be identical objects, but when I use them in an apply function, I get some trouble:

df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5))
df_ab <- df[,c(1,2)]
df_AB <- subset(df, select = c(1,2))
identical(df_ab,df_AB)
[1] TRUE

apply(df_ab,2,function(x) identical(1:5,x))
    a     b 
TRUE FALSE

apply(df_AB,2,function(x) identical(1:5,x))
    a     b 
FALSE FALSE
like image 208
mbh86 Avatar asked Oct 20 '14 14:10

mbh86


People also ask

What does it mean to subset a DataFrame?

Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.

How do you select a subset of a data frame?

Select a subset of rows and columns combined The loc or iloc operators are needed. The section before the comma is the rows you choose, and the part after the comma is the columns you want to pick by using loc or iloc.

What is subsetting in Pandas?

With Selection, Slicing, Indexing and Filtering There are many different ways of subsetting a Pandas DataFrame. You may need to select specific columns with all rows. Sometimes, you want to select specific rows with all columns or select rows and columns that meet a specific criterion, etc.

How to subset A Dataframe and store it?

To subset a dataframe and store it, use the following line of code : This creates a separate data frame as a subset of the original one. 2. Selecting Rows You can use the indexing operator to select specific rows based on certain conditions. For example to select rows having population greater than 500 you can use the following line of code.

How to select a subset of a Dataframe using the indexing operator?

Select a Subset of a Dataframe using the Indexing Operator 1 Selecting Only Columns#N#To select a column using indexing operator use the following line of code.#N#housing... 2 Selecting Rows More ...

What is the difference between a Dataframe and a subdataframe?

by default always produces a data frame. The additional differences follow the available keyword arguments: (but by default a data frame is returned). a SubDataFrame instead of a DataFrame.

What are data frames in Python?

If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.


4 Answers

The apply() function coerces its first argument to a matrix before calling the function on each column. So your data frames are coerced to matrix objects. A consequence of that conversion is that as.matrix(df_AB) has non-null rownames, while as.matrix(df_ab) does not:

> str(as.matrix(df_ab))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df_AB))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"

So when you apply() subset a column of df_AB, you get a named vector, which is not identical to an unnamed vector.

apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

Contrast that with the subset() function, which selects rows using a logical vector for the value of i. And it looks like subsetting a data.frame with a non-missing value for i causes this difference in the row.names attribute:

> str(as.matrix(df[1:5, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df[, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"

You can see the all the gory details of the difference between the data.frames using the .Internal(inspect(x)) function. You can look at those yourself, if you're interested.

As Roland pointed out in his comments, you can use the .row_names_info() function to see the differences in only the row names.

Notice that when i is missing, the result of .row_names_info() is negative, but it is positive if you subset with a non-missing i.

> .row_names_info(df_ab, type=1)
[1] -5
> .row_names_info(df_AB, type=1)
[1] 5

What these values mean is explained in ?.row_names_info:

type: integer.  Currently ‘type = 0’ returns the internal
      ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the
      number of rows implied by the attribute, and ‘type = 1’ the
      latter with a negative sign for ‘automatic’ row names.
like image 91
Joshua Ulrich Avatar answered Oct 08 '22 05:10

Joshua Ulrich


If you want to compare the values 1:5 with the values in the columns, you should not use apply since apply transforms the data frames to matrices before the functions are applied. Due to the row names in the subset created with [ (see @Joshua Ulrich's answer), the values 1:5 are not identical to a named vector including the same values.

You should instead use sapply to apply the identical function to the columns. This avoids transforming the data frames to matrices:

> sapply(df_ab, identical, 1:5)
    a     b 
 TRUE FALSE 
> sapply(df_AB, identical, 1:5)
    a     b 
 TRUE FALSE 

As you can see, in both data frames the values in the first column are identical to 1:5.

like image 38
Sven Hohenstein Avatar answered Oct 08 '22 04:10

Sven Hohenstein


In one version (using [) your columns are integers, while in the other version (using subset) your columns are named integers.

apply(df_ab, 2, str)

 int [1:5] 1 2 3 4 5
 int [1:5] 6 5 4 3 2
NULL


apply(df_AB, 2, str)

 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL
like image 23
Andrie Avatar answered Oct 08 '22 04:10

Andrie


Looking at the structure of those two object s before they get submitted to apply shows only one difference: in the rownames, but not a difference that I would have expected to produce the difference you are seeing. I do not see Joshua's current offer of 'subset' as logical indexing as explaining this. Why row.names = c(NA, -5L)) produces a named result when extracting with "[" is as yet unexplained.

> dput(df_AB)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), row.names = c(NA, 5L), class = "data.frame")
> dput(df_ab)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), class = "data.frame", row.names = c(NA, -5L))

I do agree that it is the as.matrix coercion which needs further investigation:

> attributes(df_AB[,1])
NULL
> attributes(df_ab[,1])
NULL
> attributes(as.matrix(df_AB)[,1])
$names
[1] "1" "2" "3" "4" "5"
like image 42
IRTFM Avatar answered Oct 08 '22 03:10

IRTFM