Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset a dataframe using a logical vector with $

I'm having trouble understanding both the reason for use and behavior of the $ symbol in subsetting a data.frame in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):

temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)

Calling temp_df obviously outputs:

  a b c
1 1 4 7
2 2 5 8
3 3 6 9

The example given in the course is then:

temp_df[temp_df$c < 10]

Which outputs:

  a b c
1 1 4 7
2 2 5 8
3 3 6 9

Reason for use question: The course indicates that $ is used for partial matching, and that x$y is an exact substitute for x[["y", exact=FALSE]]. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)

Behavior question: it appears the above example temp_df[temp_df$c < 10] is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9] returns:

  a b
1 1 4
2 2 5
3 3 6

Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:

  a b c
1 1 4 7
2 2 5 8

Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.

like image 709
Richard Golz Avatar asked Mar 27 '18 18:03

Richard Golz


2 Answers

Try breaking down the operation in steps.

temp_df$c < 9

gives a vector as follows:

[1]  TRUE  TRUE FALSE

When you pass this vector in the manner you have shown: temp_df[c(TRUE, TRUE, FALSE)] has the effect of operating on columns.

Think about a data.frame as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.

The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ] gives:

  a b c
1 1 4 7
2 2 5 8
like image 69
Sun Bee Avatar answered Oct 21 '22 04:10

Sun Bee


Both the $ and [[ are extract operator which allows to extract elements by name.

OP has raised one query about behavior of exact argument. The exact argument of the [[ operator has been documented in RStudio as:

Controls possible partial matching of [[ when extracting by a character vector (for most objects, but see under ‘Environments’). The default is no partial matching. Value NA allows partial matching but issues a warning when it occurs. Value FALSE allows partial matching without any warning.

What does it mean? To understand its behavior lets change the column names of data.frame used by OP as:

names(temp_df) <- c("aa","bb","cc")

#partial name of column will work with exact = FALSE
temp_df[["a", exact = FALSE]]
#[1] 1 2 3
#partial name of column will not work with exact = TRUE
temp_df[["a", exact = TRUE]]
#NULL
temp_df[["a", exact = NA]]
#[1] 1 2 3
#Warning message:
#In .subset2(x, i, exact = exact) : partial match of 'a' to 'aa' 
like image 44
MKR Avatar answered Oct 21 '22 03:10

MKR