Dplyr

Overview

Dplyr is the main data manipulation package for tibbles in tidyverse.

Dplyr is described as a “grammar of data manipulation” using verbs as the name of its various functions.

dplyr homepage

This website aims to quickly cover the most commonly used dplyr functions and uses. Therefore there are a lot more dplyr functions than those covered here. Please check the below link for the full list.

Full dplyr reference page

Sections

There are many sections for dplyr. These are summarised below.

Pipes

Pipes (|>) are a vital part of creating efficient and clear code with tidyverse. Pipes allow you to chain/pipe functions together. It can be used for all functions not just those from tidyverse.

Rows

There four main verbs (i.e. functions) to manipulate rows. These are:

  • arrange(): Arrange the rows of a tibble. Can be used to reorder the rows based on the values of a column.
  • distinct(): Extracts unique/distinct rows from a tibble.
  • filter(): Extract rows by filtering with conditions. This can be used to pick rows of certain groups, filter based on numeric sizes, and more.
  • slice(): A set of methods to choose a slice of rows based on index positions, top and bottom observations, and min and max values based on a specific column. This is especially useful for piping (|>).

Columns

There six main verbs (i.e. functions) to manipulate columns. These are:

  • glimpse(): Print a tibble in a transposed manner. Useful for seeing the data types of all the columns.
  • mutate(): Mutate columns to create new columns based on existing ones, modify existing columns, and delete columns.
  • pull(): Pull out a single column from a tibble, resulting in a vector.
  • relocate(): Relocate columns. You can relocate columns to the start or end, and you can move them after or before specified columns.
  • rename(): Rename columns in a tibble.
  • select(): Select specific columns of a tibble. Can be used with a variety of helper functions such as starts_with(), ends_with(), contains(), and matches().

If you would like to carry out one of the column functions with multiple columns you can look at the official documentation for the following functions:

  • across(): Operate on multiple columns simultaneously.
  • pick(): Select a subset of columns.

Grouping

Tibbles can be grouped by a specific variable/column or multiple variables/columns. This allows for group wise calculations.

  • group_by(): Converts a tibble to a grouped tibble.
  • count(): Counts the number of instances of each unique value for the grouping in a tibble.
  • summarise(): Produces a tibble with summary information on the group members in a grouped tibble.
    • Various functions can be used to calculate various summary information including n(), mean(), median(), sd(), IQR(), first(), last(), and nth().

Bind tibbles

Tibbles can be combined/bound together with the following functions:

  • bind_cols(): Bind 2 tibbles by columns (i.e. bind the tibbles side by side). The two tibbles must have the same number of rows.
  • bind_row(): Bind 2 tibbles by rows (i.e. bind one tibble on top of the other ). The two tibbles must have the same column types and names.