Plotting bar charts in R, geom_bar vs geom_col

By Synnøve Yndestad in R Visualizations

October 13, 2022

Plotting bar charts in R, geom_bar vs geom_col

Plotting the Nightingale data made me realize that there are more to plotting a bar chart than than first meets the eye.

While a histogram visualize the distribution of a numerical variable, a bar plot visualize the relationship between a categorical variable and a numerical variable.

ggplot has two functions for plotting bar charts, geom_bar and geom_col.

In short:
geom_bar() -> Counts Categorical units, no y input
geom_col() -> Plot the Numeric value, need numerical y input

By default, geom_bar() counts the number of occurrences for each level of a categorical variable. This makes the height of the bar equal to the number of cases in each level. The default setting in geom_bar() is stat = "count", position = "stack". Thereby, calling the default setting geom_bar() in a ggplot will stack each count in the bar on top of each other.

I will demonstrate with the Palmer penguins data set.

library(tidyverse)
library(palmerpenguins)
data("penguins")
# Remove missing values
(penguins = penguins %>% na.omit())
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Count and plot the number of each species of penguins with geom_bar():

penguins %>% 
ggplot(aes(x = species, fill = species)) +
  geom_bar() +
  theme_bw() +
  ggtitle("The number of penguins of each species: \nCounting the number of cases with geom_bar()")

geom_col() uses the numerical values in the data for the height of the bars, and is useful for when you have already summarized the data.

# Summarize the data
penguins_summarized =   penguins %>% group_by(species) %>% 
                                     count(island)
penguins_summarized
## # A tibble: 5 × 3
## # Groups:   species [3]
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        55
## 3 Adelie    Torgersen    47
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      119

Use geom_col() to plot the summarized data:

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = species )) +
  geom_col() +
  theme_bw() +
  ggtitle("The number of penguins of each species: \nPlot of summarized data with geom_col()")

geom_col() == geom_bar(stat = “identity”)

You can use geom_bar() for numerical values too if you specify stat = "identity" within geom_bar().

If you change geom_bar() to geom_bar(stat = "identity"), it will take a numerical input and perform identical to the geom_col() function. Then you will also need to specify y in the aes call.

geom_bar(stat = "identity") = Numerical

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = species )) +
  geom_bar(stat = "identity") +
  theme_bw() +
  ggtitle("The number of penguins of each species: \nPlot of summarized data with geom_bar(stat = \"identity\")")

position = “stack”

geom_col() stacks the levels on top or each other, and has position = "stack" set as default. This can be visualized by adding the islands as fill.

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = island )) +
  geom_col() +
  theme_bw() +
  ggtitle("The number of penguins of each species: \nPlot of summarized data with geom_col()")

position = “dodge”

If you want the columns next to each other, you need to specify position = "dodge".

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = island )) +
  geom_col(position = "dodge") +
  theme_bw() +
  ggtitle("The number of penguins on the different islands by species:  \nPlot of summarized data with geom_col(position = \"dodge\")")

position = “identity”

If you change position to “identity”, the columns will “un-stack” and be plotted in front of each other. Be aware that this may hide some of your data. Where did the penguins from the Biscoe island go in the Adelie group?

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = island )) +
  geom_col(position = "identity") +
  theme_bw() +
  ggtitle("The number of penguins on the different islands by species:  \nPlot of summarized data with geom_col(position = \"identity\")")

By arranging the count in descending order with arrange(desc(n)), the smallest group will be plotted in front, and all will be visible.

penguins_summarized %>% 
  arrange(desc(n)) %>%       
ggplot(aes(x = species, y = n, fill = island )) +
  geom_col(position = "identity") +
  theme_bw() +
  ggtitle("The number of penguins on the different islands by species:  \nPlot of summarized data with geom_col(position = \"identity\") when ordered")

position = “fill”

Changing the position to “fill” will plot a percent stacked barplot

penguins_summarized %>% 
ggplot(aes(x = species, y = n, fill = island )) +
  geom_col(position = "fill") +
  theme_bw() +
  ggtitle("The percentage of penguins on the different islands by species: \nPlot of summarized data with geom_col(position = \"fill\") ")

Summary:

geom_bar() -> Counts Categoricals, do not need y
geom_col() -> Plot the Numeric value, need y

geom_bar(stat = “identity”) -> Plot the Numeric value, need y

geom_col() == geom_bar(stat = “identity”)

position = “dodge” -> split the stack and puts the columns next to each other
position = “identity” -> split the stack and puts the columns behind each other
position = “fill” -> makes a percent stacked bar plot

Posted on:
October 13, 2022
Length:
5 minute read, 921 words
Categories:
R Visualizations
Tags:
tidyverse ggplot geom_bar() geom_col() Categoricals
See Also:
Read and merge multiple files by folder
Plot coxcomb diagrams like Florence Nightingale
For loop for Multiple Trend in Proportions