I have data like:
Name Count
Object1 110
Object2 111
Object3 95
Object4 40
...
Object2000 1
So only the first 3 objects have high counts, the rest 1996 objects have fewer than 40, with the majority less than 10. I am plotting this data with ggplot
bar like:
ggplot(data=object_count, mapping = aes(x=object, y=count)) +
geom_bar(stat="identity") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
My plot is as below. As you can see, because there are so many objects with low counts, the width of the graph is very long, and the width of the bar is tiny, which is almost invisible for the hight-counts objects. Is there a better way to represent this data? My goal is to show a few top-count objects and to show there are many low-count ones. Is there a way to group the low count ones together?
My guess is that your data looks something like this:
set.seed(1)
object_count <- tibble(
obj_num = 1:2000,
object = paste0("Object", obj_num),
count = ceiling(20 * rpois(2000, 10) / obj_num)
)
head(object_count)
## A tibble: 6 x 3
# obj_num object count
# <int> <chr> <dbl>
#1 1 Object1 160
#2 2 Object2 100
#3 3 Object3 46
#4 4 Object4 55
#5 5 Object5 56
#6 6 Object6 40
Sure enough, when I plot that with ggplot(object_count, aes(object, count)) + geom_col() + [theme stuff]
I get a similar figure.
Here are some strategies "to show a few top-count objects and to show there are many low-count ones."
A vanilla histogram might not be clarifying here, since the important big values appear dramatically less often and would not be prominent enough:
ggplot(object_count, aes(count)) +
geom_histogram()
But we could change that by transforming the y axis to bring more emphasis to small values. The pseudo_log
transformation is nice for that since it works like a log transform for large values, but linearly near -1 to 1. In this view, we can clearly see where the outliers with just one appearance are, but also see that there are many more small values. The binwidth = 1
here could be set to something wider if the specific values of the big values aren't as important as their general range.
ggplot(object_count, aes(count)) +
geom_histogram(binwidth = 1) +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 100, 1000), minor_breaks = NULL)
Another option could be to split your view into two pieces, one with detail on the big values, the other showing all the small values:
object_count %>%
mutate(biggies = if_else(count > 20, "Big", "Little")) %>%
ggplot(aes(obj_num, count)) +
geom_col() +
facet_grid(~biggies, scales = "free")
Another option might be too lump together all the counts under 10. The version below emphasizes the object name and count, and the "Other" category has been labeled to show how many values it includes.
object_count %>%
mutate(group = if_else(count < 10, "Others", object)) %>%
group_by(group) %>%
summarize(avg = mean(count), count = n()) %>%
ungroup() %>%
mutate(group = if_else(group == "Others",
paste0("Others (n =", count, ")"),
group)) %>%
mutate(group = forcats::fct_reorder(group, avg)) %>%
ggplot() +
geom_col(aes(group, avg)) +
geom_text(aes(group, avg, label = round(avg, 0)), hjust = -0.5) +
coord_flip()
If you're interested in the share of total count, you might also look at the cumulative count and see how the big values make up a large share:
object_count %>%
mutate(cuml = cumsum(count)) %>%
ggplot(aes(obj_num)) +
geom_tile(aes(y = count + lag(cuml, default = 0),
height = count))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With