Box plots can display the distribution of a single continuos variable. It can be particularly useful with large amounts of data where other plots cannot effectively display all the values.
In this page we will create scatter plots with ggplot2::geom_boxplot(). Through examples we will demonstrate creating:
For demonstration we’ll load the bat_roost_tbl data from the mgrtibbles package (hyperlink includes install instructions). To preprocess the data we will:
Remove the RoostCodevariable/column with dplyr::select.
Remove rows/observations with a MaxCountInAnyYear greater than or equal to 250 with dplyr::filter()
Remove rows/observations from the Country grouping “British Crown Dependencies” to reatin the countries England, Northern Ireland, Scotland, and Wales with dplyr::filter()
Remove rows/observations where the the Country or Species value is NA with tidyr:drop_na()
#Load packagelibrary("mgrtibbles")#mushroom_tbl tibble for demonstrationbat_roost_tbl <- mgrtibbles::bat_roost_tbl |>#Select all but roost code column dplyr::select(!RoostCode) |>#Filter out rows with a MaxCountInAnyYear greater than or equal to 250# and remove observations from "British Crown Dependencies" dplyr::filter(MaxCountInAnyYear <250& Country !="British Crown Dependencies") |>#Remove rows where Country or Species is NA tidyr::drop_na(c("Country","Species"))bat_roost_tbl |> dplyr::glimpse()
Create a box plot of the MaxCountInAnyYear (y axis). This will show the distribution of the max yearly population count within each of the tracked roosts in the UK.
Box plots become very powerful when comparing distributions of the same continuous variable across different categorical variables.
Create a box plot of MaxCountInAnyYear (y) against the categorical variable Country (x). In aes() set x=Country so there is a box and whisker for each Country, separated on the x-axis.
If the categorical variable has a lot of groups, long names, or both, you can create a flipped box plot.
Create a box plot of the numerical variable MaxCountInAnyYear on the x-axis against the categorical variable Species on the y-axis. In aes() set y=Species and x=MaxCountInAnyYear so there is a box and whisker for each Species, separated up on the y-axis.
You can add points on top of a box plot to visualise the individual values on top of the distribution. We can add these points with ggplot2::geom_jitter() which will cause the points to “jitter”. With this the points will be randomly distributed across the x axis if mapped to the y axis and vice versa. This is useful so points are less likely to overlap and so are easier to read.
Adding points allows us to add a second categorical variable that we can colour the points with.
Create a box plot of MaxCountInAnyYear (y) against the County (x) of Northern Ireland observations/rows.
To the box plot add the layer ggplot2::geom_jitter(). In the function add aes(colour_species) so the points, and not the box and whiskers, are coloured by species. Additionally, set size=2 in ggplot2::geom_jitter() but outside of its aes() to make the points larger.
#Boxplot of max population count in bat colonies within counties of Northern Irelandbat_roost_tbl |>#Filter to only retain Northern Ireland rows dplyr::filter(Country =="Northern Ireland") |>#Box plot ggplot2::ggplot(aes(y = MaxCountInAnyYear, x = County)) + ggplot2::geom_boxplot() + ggplot2::geom_jitter(aes(colour=Species), size=2)
Violin plot
Violin plots are very similar to box plots but they display the probability density across the values. This can be more appropriate to datasets that do not follow bell curve/normal distributions. In fact, some people prefer using violin plots to box plots, myself included.
We will create a violin plot with jittered points of the different Pipistrelle species across the four countries.
Create a violin plot of MaxCountInAnyYear (y) against Country (x). Add a layer of jittered points coloured by Species.
#Violin plot of max population count of Pipistrelle colonies in the UK countriesbat_roost_tbl |>#Filter to only retain Northern Ireland rows dplyr::filter(Species %in%c("Common pipistrelle", "Pipistrelle species","Soprano pipistrelle")) |>#Box plot ggplot2::ggplot(aes(y = MaxCountInAnyYear, x = Country)) + ggplot2::geom_violin() + ggplot2::geom_jitter(aes(colour=Species), size=2)