Why does subsetting a column from a data frame vs. a tibble give different results

Tags:

This is a 'why' question and not a 'How to' question.

I have a tibble as a result of an aggregation dplyr

> str(urls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   144 obs. of  4 variables:
 $ BRAND       : chr  "Bobbi Brown" "Calvin Klein" "Chanel" "Clarins" ...
 $ WEBSITE     : chr  "http://www.bobbibrowncosmetics.com/" "http://www.calvinklein.com/shop/en/ck" "http://www.chanel.com/en_US/" "http://www.clarinsusa.com/" ...
 $ domain      : chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...
 $ final_domain: chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...

When I try to extract the column final_domain as a character vector here's what happens:

> length(as.character(urls[ ,4]))
[1] 1

When I instead, coerce to data frame and then do it, I get what I actually want:

> length(as.character(as.data.frame(urls)[ ,4]))
[1] 144

The str of the tibble vs. dataframe looks the same but output differs. I'm wondering why?

424

asked Oct 07 '16 13:10

vagabond

3 Answers

The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.

By default, [.data.frame will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector.
[.tbl_df will never drop dimensions like this; it always returns a tbl.

In turn, as.character ignores the class of a tbl, treating it as a plain list. And as.character called on a list acts like deparse: the character representation it returns is R code that can be parsed and executed to reproduce the list.

The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.

If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]] or urls$final_domain.

198

answered Sep 28 '22 03:09

Hong Ooi

I think the fundamental answer to your question is that Hadley Wickham, when writing tibble 1.0, wanted consistent behavior of the [ operator. This decision is discussed, somewhat indirectly, in Wickham's Advanced R in the chapter on Subsetting:

It’s important to understand the distinction between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want. Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type. Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors. (It will work for your test cases, but then someone will pass in a single column data frame and it will fail in an unexpected and unclear way.)

Here, we can clearly see that Hadley is concerned with the inconsistent default behavior of [.data.frame, and why he would choose to change the behavior in tibble.

With the above terminology in mind, it's easy to see that whether the [.data.frame operator produces a simplifying subset or a preserving subset by default is dependent on the input rather than the programming. e.g., take a data frame data_df and subset it:

data_df <- data.frame(a = runif(10), b = letters[1:10])

data_df[, 2]
data_df[, 1:2]

You get a vector in one case and a data frame in the other. To predict the type of output, you have to either know in advance how many columns are going to be subsetted (i.e. you have to know length(list_of_columns)), which may come from user input, or you need to explicitly add the drop = parameter. So the following produces the same class of object, but the added parameter is unnecessary in the second case (and may be unknown to the majority of R users):

data_df[, 2, drop = FALSE]
data_df[, 1:2, drop = FALSE]

With tibble (or dplyr), we have consistent behavior by default, so we can be assured of having the same class of object when subsetting with the [ operator no matter how many columns we return:

library(tibble)
data_df <- tibble(a = runif(10), b = letters[1:10])

data_df[, 2]
data_df[, 1:2]

answered Sep 28 '22 03:09

Tom

If you print the result of as.character, you'll notice the difference:

library(tibble)
x <- tribble(
    ~x, ~y,  ~z,
    "a", 2,  3.6,
    "b", 1,  8.5
)

as.character(as.data.frame(x)[ ,2])
# [1] "2" "1"

as.character(x[ ,2])
# "c(2, 1)"

as.character converts the column to a single string. This thread should be helpful: https://stackoverflow.com/questions/21618423/extract-a-dplyr-tbl-column-as-a-vector

answered Sep 28 '22 01:09

mt1022

Related questions
                            
                                Querying Oracle DB from Revolution R using RODBC
                            
                                Deleting specific rows from a data frame
                            
                                Paste together two character vectors of different lengths
                            
                                k-means: Same clusters for every execution
                            
                                Testing if rows of a matrix or data frame are sorted in R
                            
                                Import stuff from a R file
                            
                                R: find nearest index
                            
                                Fitting logarithmic curve in R
                            
                                Read.CSV not working as expected in R
                            
                                Generate combination of data frame and vector
                            
                                sapply with custom function (series of if statements)
                            
                                Convertic non-numeric factor to numeric column with mapping in R
                            
                                How can i rescale every column in my data frame to a 0-100 scale? (in r)
                            
                                Print number as word if less than 10
                            
                                Convert numbers to letters
                            
                                IPython notebook and rmagic/rpy2: cannot find module ri2py (OSX 10.8.5, python 2.7, R 3.1)
                            
                                read.sas7bdat unable to read compressed file
                            
                                adding successive four / n numbers in large matrix in R
                            
                                Create sequential counter that restarts on a condition within panel data groups [duplicate]
                            
                                Add one year to a posix time [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does subsetting a column from a data frame vs. a tibble give different results

Tags:

dataframe

r

dplyr

subset

vagabond

People also ask

3 Answers

Hong Ooi

Tom

mt1022

Recent Activity

Donate For Us