Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tbl_df and data.frame difference when using loops

Tags:

r

dplyr

I've been looping over values in a dplyr tbl_df, trying to print unique combinations of two columns. After much trial and error I've only been able to get exactly the desired output by converting the tbl_df back to a standard data.frame. I'm aware of the main differences between the two structures but I still cant understand the differing output I'm seeing with each.

For example, using this data

hospital <- rep(c("Hospital 1", "Hospital 2", "Hospital 3"), 3)
ward <- LETTERS[1:2]
hospitals <- data.frame(cbind(hospital, ward))
hospitals[order(hospitals$hospital, hospitals$ward), ]

#     hospital ward
# 1 Hospital 1    A
# 7 Hospital 1    A
# 4 Hospital 1    B
# 5 Hospital 2    A
# 2 Hospital 2    B
# 8 Hospital 2    B
# 3 Hospital 3    A
# 9 Hospital 3    A
# 6 Hospital 3    B

and the following loop

for(hosp in unique(hospitals$hospital)){
  for(wa in unique(hospitals[hospitals$hospital==hosp, "ward"])){
    print(paste(hosp, wa, sep=" "))
    }
  }

I can get my desired output

#[1] "Hospital 1 A"
#[1] "Hospital 1 B"
#[1] "Hospital 2 B"
#[1] "Hospital 2 A"
#[1] "Hospital 3 A"
#[1] "Hospital 3 B"

But using a tbl_df of the same data I get a different output

hospitals2 <- tbl_df(hospitals)

for(hosp in unique(hospitals2$hospital)){
  for(wa in unique(hospitals2[hospitals2$hospital==hosp, "ward"])){
    print(paste(hosp, wa, sep=" "))
    }
  }


#[1] "Hospital 1 A" "Hospital 1 B"
#[1] "Hospital 2 B" "Hospital 2 A"
#[1] "Hospital 3 A" "Hospital 3 B"

It's not just a printing difference, this appears to be three two-element vectors instead of six one-element vectors, and my subsequent code only works as expected when I run the loop on a normal dataframe.

Can anyone explain why I'm seeing these differences?

like image 270
peter_w Avatar asked Mar 02 '15 13:03

peter_w


People also ask

What is the difference between a tibble and a data frame in R?

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

What does Tbl_df do in R?

tbl_df object is a data frame providing a nicer printing method, useful when working with large data sets. In this article, we'll present the tibble R package, developed by Hadley Wickham. The tibble R package provides easy to use functions for creating tibbles, which is a modern rethinking of data frames.

What is the tbl_ df class in R?

The tbl_df class is a subclass of data. frame , created in order to have different default behaviour. The colloquial term "tibble" refers to a data frame that has the tbl_df class. Tibble is the central data structure for the set of packages known as the tidyverse, including dplyr, ggplot2, tidyr, and readr.


1 Answers

You can't do for loop on tbl_df with subsetting[. Documentation says it all :

[ Never simplifies (drops), so always returns data.frame.

You see that hospitals2[hospitals2$hospital==hosp, "ward"] returns data.frame

hospitals2[hospitals2$hospital==hosp, "ward"]
#Source: local data frame [3 x 1]

#  ward
#1    A
#2    B
#3    A

whereas

hospitals[hospitals$hospital==hosp, "ward"]
#[1] A B A
#Levels: A B

Use [[ to extract a column vector, for instance

for(hosp in unique(hospitals2$hospital)){
    for(wa in unique(hospitals[hospitals$hospital==hosp,][["ward"]])){
        print(paste(hosp, wa, sep=" "))
    }
} 
#[1] "Hospital 1 A"
#[1] "Hospital 1 B"
#[1] "Hospital 2 B"
#[1] "Hospital 2 A"
#[1] "Hospital 3 A"
#[1] "Hospital 3 B"
like image 185
Khashaa Avatar answered Oct 19 '22 20:10

Khashaa