Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is stat = "identity" necessary in geom_bar in ggplot?

Tags:

r

ggplot2

From this question we see a simple geom_line in the answer.

library(dplyr)
BactData %>% filter(year(Date) == 2017) %>% 
  ggplot(aes(Date, Svartediket_CB )) + geom_line()

If we change geom_line to geom_bar we may expect to see a bar plot, but instead

Error: stat_count() must not be used with a y aesthetic.

But it works if we add stat = "identity", like so

library(dplyr)
BactData %>% filter(year(Date) == 2017) %>% 
  ggplot(aes(Date, Svartediket_CB )) + geom_bar(stat = "identity")

Why doesn't geom_bar work without stat = "identity" - i.e. what is the purpose of stat = "identity"?

like image 354
stevec Avatar asked Nov 23 '19 15:11

stevec


People also ask

Why do we use stat identity in R?

If it is stat = "identity" , we are asking R to use the y-value we provide for the dependent variable. If we specify stat = "count" or leave geom_bar() blank, R will count the number of observations based on the x-variable groupings.

What does stat mean in ggplot2?

simply use stat = "summary" and fun.y = "mean" ggplot(test2) + geom_bar(aes(label, X2, fill = as.factor(groups)), position = "dodge", stat = "summary", fun.y = "mean")

What does Geom_col do in R?

geom_col makes the height of the bar from the values in dataset.

Which parameter of the Ggplot () function changes the border color of the bars in a bar chart to blue?

color. The color parameter modifies the color of the border of the bars.


2 Answers

There are two layers that are closely related: geom_bar() and geom_col(). The key difference is how they aggregate the data by default.

For geom_bar(), the default behavior is to count the rows for each x value. It doesn't expect a y-value, since it's going to count that up itself -- in fact, it will flag a warning if you give it one, since it thinks you're confused. How aggregation is to be performed is specified as an argument to geom_bar(), which is stat = "count" for the default value.

If you explicitly say stat = "identity" in geom_bar(), you're telling ggplot2 to skip the aggregation and that you'll provide the y values. This mirrors the natural behavior of geom_col() below.

In the case of geom_col(), it won't try to aggregate the data by default. From the docs, "geom_col() uses stat_identity(): it leaves the data as is". So, it expects you to already have the y values calculated and to use them directly. And geom_col() doesn't have an argument to change that behavior - it's always going to plot your y values that you provide, and you need to provide them.

If you have y values, you could use either syntax, but I find geom_col() more direct.

like image 89
ravic_ Avatar answered Oct 20 '22 10:10

ravic_


@Stevec.

I found the answer at rdocumentation.org.

See below what means stat='identity':

"The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat="bin". This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat="identity" and map a value to the y aesthetic."

Hope this was helpful.

Follow the link to documentation: geom_bar documentation

like image 35
VictorSaraivaRocha Avatar answered Oct 20 '22 09:10

VictorSaraivaRocha