Quick and short of it is I'm having problems summarizing count and aggregate functions with conditions on the same factor.
Suppose I have this dataframe:
library(dplyr)
df = tbl_df(data.frame(
company=c("Acme", "Meca", "Emca", "Acme", "Meca", "Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"),
product=c("Wrench", "Hammer", "Sonic Screwdriver", "Fairy Dust",
"Kindness", "Helping Hand"),
price=c("5.67", "7.12", "12.99", "10.99", NA, FALSE)))
which creates this dataframe (in essence):
company year product price
1 Acme 2011 Wrench 5.67
2 Meca 2010 Hammer 7.12
3 Emca 2009 Sonic Screwdriver 12.99
4 Acme 2011 Fairy Dust 10.99
5 Meca 2010 Kindness NA
... ... ... ... ...
n Emca 2013 Helping Hand FALSE
Let's say I want to df <- group_by(df, company, year, product)
and then get the following info all in one collection (i.e. dataframe):
Max price
summarize(df, count = n()) #satisfies first item obviously
I'm having issues trying to get the others. I think I need to use pipe operators? If so, can anyone provide some guidance?
This is what I've tried and it is blatantly wrong, but I'm not sure where to go next:
summarize(df,
total.count = n(),
count = filter(df, is.na(price)),
avg.price = filter(df, !is.na(price), price != FALSE),
max.price = max(filter(df, !is.na(price), price != FALSE))
And yes, I have reviewed documentation and I'm sure the answers are there, but they might be too advanced for my understanding. Thanks in advance!
The COUNT() function returns the number of rows in a database table.
Conditional aggregation, as its name implies, is performing data aggregation over a set of data that meets certain condition which is contained within a given data range.
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean ) on the third. An alternative is to specify a formula of the form: numerical ~ categorical .
Assuming that your original dataset is similar to the one you created (i.e. with NA
as character
. You could specify na.strings
while reading the data using read.table
. But, I guess NAs would be detected automatically.
The price
column is factor
which needs to be converted to numeric
class. When you use as.numeric
, all the non-numeric elements (i.e. "NA"
, FALSE) gets coerced to NA
) with a warning.
library(dplyr)
df %>%
mutate(price=as.numeric(as.character(price))) %>%
group_by(company, year, product) %>%
summarise(total.count=n(),
count=sum(is.na(price)),
avg.price=mean(price,na.rm=TRUE),
max.price=max(price, na.rm=TRUE))
I am using the same dataset
(except the ...
row) that was showed.
df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
"Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
"7.12", "12.99", "10.99", "NA",FALSE)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With