Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between data[ , "col"] and data$col

Tags:

dataframe

r

From other answers on this site on similar questions, and e.g. from pages like http://www.r-tutor.com/r-introduction/data-frame/data-frame-column-vector , it seems that I extract a variable from a data.frame, data[ , "col"] and data$col should yield the same result. But now I have some data in Excel:

LU  Urban_LU    LU_Index    Urban_LU_index
Residential Residential 2   0
Rural residential   Residential 3   0
Commercial  Commercial  4   1
Public institutions including education Industrial  5   1
Industry    Industrial  7   2

)

and I read it with read_excel from the readxl package:

library(readxl)
data <- read_excel("data.xlsx", "Sheet 1")

Now I extract a single variable from the data frame, using [ or $:

data[ , "LU"]
# Source: local data frame [5 x 1]
# 
#                                        LU
#                                     (chr)
# 1                             Residential
# 2                       Rural residential
# 3                              Commercial
# 4 Public institutions including education
# 5                                Industry

data$LU
# [1] "Residential"                             "Rural residential"                      
# [3] "Commercial"                              "Public institutions including education"
# [5] "Industry"                               

length(data[ , "LU"])
# [1] 1
length(data$LU)
# [1] 5

Also, what I find suspicious are the classes of the data obtained from read_excel and the data which results from the two different modes of extraction:

class(data)
# [1] "tbl_df"     "tbl"        "data.frame"

class(data[ , "LU"])
# [1] "tbl_df"     "data.frame"

class(data$LU)
# [1] "character"
> 

So what's the difference between [ , "col"] and $col? Am I missing something from the manuals or is this a special case? Also, what's with the tbl_df and tbl class identifiers? I suspect that they are the cause of my confusion, what do they mean?

like image 657
Roel Avatar asked Nov 09 '22 01:11

Roel


1 Answers

More of an extended comment:

The fact that readxl::read_xl returns output of class tbl_df seems poorly documented in ?read_xl. This behaviour was mentioned in the announcement of readxl on the RStudio blog though:

"[read_xl r]eturns output with class c("tbl_df", "tbl", "data.frame")"

To learn more about tbl_df, we need to consult the dplyr help pages. In the Methods section of ?dplyr::tbl_df, we find that "tbl_df implements two important base methods: [ Never simplifies (drops), so always returns data.frame".

For more background, read about the drop argument in ?[.data.frame.

Related Q&A: Extract a dplyr tbl column as a vector and Best practice to get a dropped column in dplyr tbl_df.

See also the 'original' issue on github and the discussion therein.

like image 126
Henrik Avatar answered Nov 15 '22 07:11

Henrik