Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between geoms and stats in ggplot2?

Tags:

plot

r

ggplot2

Both geoms and stats can be used to make plots in the R package ggplot2, and they often give similar results (e.g., geom_area and stat_bin). They also often have slightly different arguments, e.g. in 2-D density plots:

geom_density_2d(mapping = NULL, data = NULL, stat = "density2d",
  position = "identity", ..., lineend = "butt", linejoin = "round",
  linemitre = 1, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)

stat_density_2d(mapping = NULL, data = NULL, geom = "density_2d",
  position = "identity", ..., contour = TRUE, n = 100, h = NULL, na.rm =
  FALSE, show.legend = NA, inherit.aes = TRUE)

Are there any fundamental differences between the two types of objects?

like image 418
tluh Avatar asked Aug 04 '16 19:08

tluh


3 Answers

This is only meant to supplement the accepted answer.

According to Hadley Wickkam, the author of ggplot2, in his book 'ggplot2: Elegant Graphics for Data Analysis' (link here) on p. 91, of section 5.2 'Building a plot layer by layer' :

You only need to set one of stat and geom: every geom has a default stat, and every stat has a default geom.

The accepted answer above explains well why the two are different. This is meant to explain why they are difficult to distinguish in practice -- whenever you use a geom layer, you are also implicitly using a stat layer (even if it is just the identity transformation); likewise, whenever you use a stat layer, you are also implicitly using a geom layer.

If you are fine with the defaults used by either layer, then it would be redundant to state explicitly both layers. Even if you are not fine with the defaults provided by either layer, you can modify the defaults as parameters to each layer (i.e. you can modify the default geom as a parameter to pass to any stat_* function, and you can modify the default stat as a parameter to pass to any geom_* function). In the words of Hadley Wickham (same source as above):

You can pass params in ... (in which case stat and geom parameters are automatically teased apart)

This is kind of difficult to understand conceptually, which is why I have had this question as well. In his paper about the philosophy underlying ggplot2, found here, in Section 4, a 'Hierarchy of Defaults', Hadley Wickham explains the practical considerations behind this default behavior in terms of simplifying code which would otherwise unnecessarily long.

For example, without default specifications, and using the grammar of graphics alone, the code for a simple scatter plot might look like:

ggplot() +
layer(
data = diamonds, mapping = aes(x = carat, y = price),
geom = "point", stat = "identity", position = "identity"
) +
scale_y_continuous() +
scale_x_continuous() +
coord_cartesian()

Using defaults for the scales and coordinates, we can write something instead like:

ggplot(data = Diamonds, aes(x = carat, y = price)) + 
layer(
geom = "point", stat = "identity", position = "identity"
)

But this is still annoyingly long of course, since the values of stat and position are just "identity", which basically means 'do nothing' -- so why have to say that explicitly?

However, the layer() function does not have default values for stat or position -- they need to be specified explicitly in a call to the layer() function.

To get around this, Hadley made the geom_* functions as well as the stat_* functions as wrappers to the layer() function which have default values for both the geom and stat parameter. The difference between the stat_* and geom_* functions is which parameter has an immutable (unchangeable) default value, stat or geom.

Source: http://ggplot2.tidyverse.org/reference/layer.html

So for the geom_* functions you can change the default value of the stat parameter but not the default value of the geom parameter, while for the stat_* functions you can change the default value of the geom parameter but not the default value of the stat parameter.

A layer is a combination of data, stat and geom with a potential position adjustment. Usually layers are created using geom_* or stat_* calls but it can also be created directly using this function [the layer() function].

like image 150
Chill2Macht Avatar answered Oct 07 '22 19:10

Chill2Macht


geoms stand for "geometric objects." These are the core elements that you see on the plot, object like points, lines, areas, curves.

stats stand for "statistical transformations." These objects summarize the data in different ways such as counting observations, creating a loess line that best fits the data, or adding a confidence interval to the loess line.

As geoms are the "core" of the plot, these are required objects. On the other hand, stats are not required to produce a plot, but can greatly enhance the final plot.

As @eipi10 notes in the comments, these distinctions are somewhat conceptual as the majority of geoms undergo some statistical transformation prior to being plotted. These include geom_bar, geom_smooth, and geom_quantile. Some common exceptions where the data is presented in more or less "raw" form are geom_point and geom_line and the less commonly used geom_rug.

like image 32
lmo Avatar answered Oct 07 '22 19:10

lmo


geom is for geometrical representation while stat is for statistical infos and representations. i think sometimes geom uses some stats functions such as stat_count() used by geom_bar(). in this case geom_bar takes one argument (x or y) and the stat_count takes in charge the counting of frequencies.

like image 1
Houssam Baiz Avatar answered Oct 07 '22 21:10

Houssam Baiz