I've been looping over values in a dplyr tbl_df, trying to print unique combinations of two columns. After much trial and error I've only been able to get exactly the desired output by converting the tbl_df back to a standard data.frame. I'm aware of the main differences between the two structures but I still cant understand the differing output I'm seeing with each.
For example, using this data
hospital <- rep(c("Hospital 1", "Hospital 2", "Hospital 3"), 3)
ward <- LETTERS[1:2]
hospitals <- data.frame(cbind(hospital, ward))
hospitals[order(hospitals$hospital, hospitals$ward), ]
# hospital ward
# 1 Hospital 1 A
# 7 Hospital 1 A
# 4 Hospital 1 B
# 5 Hospital 2 A
# 2 Hospital 2 B
# 8 Hospital 2 B
# 3 Hospital 3 A
# 9 Hospital 3 A
# 6 Hospital 3 B
and the following loop
for(hosp in unique(hospitals$hospital)){
for(wa in unique(hospitals[hospitals$hospital==hosp, "ward"])){
print(paste(hosp, wa, sep=" "))
}
}
I can get my desired output
#[1] "Hospital 1 A"
#[1] "Hospital 1 B"
#[1] "Hospital 2 B"
#[1] "Hospital 2 A"
#[1] "Hospital 3 A"
#[1] "Hospital 3 B"
But using a tbl_df of the same data I get a different output
hospitals2 <- tbl_df(hospitals)
for(hosp in unique(hospitals2$hospital)){
for(wa in unique(hospitals2[hospitals2$hospital==hosp, "ward"])){
print(paste(hosp, wa, sep=" "))
}
}
#[1] "Hospital 1 A" "Hospital 1 B"
#[1] "Hospital 2 B" "Hospital 2 A"
#[1] "Hospital 3 A" "Hospital 3 B"
It's not just a printing difference, this appears to be three two-element vectors instead of six one-element vectors, and my subsequent code only works as expected when I run the loop on a normal dataframe.
Can anyone explain why I'm seeing these differences?
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.
tbl_df object is a data frame providing a nicer printing method, useful when working with large data sets. In this article, we'll present the tibble R package, developed by Hadley Wickham. The tibble R package provides easy to use functions for creating tibbles, which is a modern rethinking of data frames.
The tbl_df class is a subclass of data. frame , created in order to have different default behaviour. The colloquial term "tibble" refers to a data frame that has the tbl_df class. Tibble is the central data structure for the set of packages known as the tidyverse, including dplyr, ggplot2, tidyr, and readr.
You can't do for loop
on tbl_df
with subsetting[
. Documentation says it all :
[
Never simplifies (drops), so always returnsdata.frame
.
You see that hospitals2[hospitals2$hospital==hosp, "ward"]
returns data.frame
hospitals2[hospitals2$hospital==hosp, "ward"]
#Source: local data frame [3 x 1]
# ward
#1 A
#2 B
#3 A
whereas
hospitals[hospitals$hospital==hosp, "ward"]
#[1] A B A
#Levels: A B
Use [[
to extract a column vector, for instance
for(hosp in unique(hospitals2$hospital)){
for(wa in unique(hospitals[hospitals$hospital==hosp,][["ward"]])){
print(paste(hosp, wa, sep=" "))
}
}
#[1] "Hospital 1 A"
#[1] "Hospital 1 B"
#[1] "Hospital 2 B"
#[1] "Hospital 2 A"
#[1] "Hospital 3 A"
#[1] "Hospital 3 B"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With