I'm having trouble understanding both the reason for use and behavior of the $
symbol in subsetting a data.frame
in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):
temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)
Calling temp_df
obviously outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
The example given in the course is then:
temp_df[temp_df$c < 10]
Which outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Reason for use question: The course indicates that $
is used for partial matching, and that x$y
is an exact substitute for x[["y", exact=FALSE]]
. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df
there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement
function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)
Behavior question: it appears the above example temp_df[temp_df$c < 10]
is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9]
returns:
a b
1 1 4
2 2 5
3 3 6
Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:
a b c
1 1 4 7
2 2 5 8
Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.
Try breaking down the operation in steps.
temp_df$c < 9
gives a vector as follows:
[1] TRUE TRUE FALSE
When you pass this vector in the manner you have shown:
temp_df[c(TRUE, TRUE, FALSE)]
has the effect of operating on columns.
Think about a data.frame
as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.
The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ]
gives:
a b c
1 1 4 7
2 2 5 8
Both the $
and [[
are extract
operator which allows to extract elements by name.
OP has raised one query about behavior of exact
argument. The exact
argument of the [[
operator has been documented in RStudio
as:
Controls possible partial matching of [[ when extracting by a character vector (for most objects, but see under ‘Environments’). The default is no partial matching. Value NA allows partial matching but issues a warning when it occurs. Value FALSE allows partial matching without any warning.
What does it mean? To understand its behavior lets change the column names
of data.frame used by OP as:
names(temp_df) <- c("aa","bb","cc")
#partial name of column will work with exact = FALSE
temp_df[["a", exact = FALSE]]
#[1] 1 2 3
#partial name of column will not work with exact = TRUE
temp_df[["a", exact = TRUE]]
#NULL
temp_df[["a", exact = NA]]
#[1] 1 2 3
#Warning message:
#In .subset2(x, i, exact = exact) : partial match of 'a' to 'aa'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With