From other answers on this site on similar questions, and e.g. from pages like http://www.r-tutor.com/r-introduction/data-frame/data-frame-column-vector , it seems that I extract a variable from a data.frame
, data[ , "col"]
and data$col
should yield the same result. But now I have some data in Excel:
LU Urban_LU LU_Index Urban_LU_index
Residential Residential 2 0
Rural residential Residential 3 0
Commercial Commercial 4 1
Public institutions including education Industrial 5 1
Industry Industrial 7 2
)
and I read it with read_excel
from the readxl
package:
library(readxl)
data <- read_excel("data.xlsx", "Sheet 1")
Now I extract a single variable from the data frame, using [
or $
:
data[ , "LU"]
# Source: local data frame [5 x 1]
#
# LU
# (chr)
# 1 Residential
# 2 Rural residential
# 3 Commercial
# 4 Public institutions including education
# 5 Industry
data$LU
# [1] "Residential" "Rural residential"
# [3] "Commercial" "Public institutions including education"
# [5] "Industry"
length(data[ , "LU"])
# [1] 1
length(data$LU)
# [1] 5
Also, what I find suspicious are the classes of the data obtained from read_excel
and the data which results from the two different modes of extraction:
class(data)
# [1] "tbl_df" "tbl" "data.frame"
class(data[ , "LU"])
# [1] "tbl_df" "data.frame"
class(data$LU)
# [1] "character"
>
So what's the difference between [ , "col"]
and $col
? Am I missing something from the manuals or is this a special case? Also, what's with the tbl_df
and tbl
class identifiers? I suspect that they are the cause of my confusion, what do they mean?
More of an extended comment:
The fact that readxl::read_xl
returns output of class tbl_df
seems poorly documented in ?read_xl
. This behaviour was mentioned in the announcement of readxl
on the RStudio blog though:
"[read_xl
r]eturns output with class c("tbl_df", "tbl", "data.frame")
"
To learn more about tbl_df
, we need to consult the dplyr
help pages. In the Methods section of ?dplyr::tbl_df
, we find that
"tbl_df
implements two important base methods: [
Never simplifies (drops), so always returns data.frame
".
For more background, read about the drop
argument in ?[.data.frame
.
Related Q&A: Extract a dplyr tbl column as a vector and Best practice to get a dropped column in dplyr tbl_df.
See also the 'original' issue on github and the discussion therein.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With