Misc read options

On top of choosing the delimiter (delim=) and the column types (col_types=) there are other useful options for readr::read_delim().

The three covered here are:

id=: Adds an ID column containing the file path. Useful when reading in multiple files and combining the data into one tibble.
skip=: Skips a specified number of lines from the top of the file. Useful if the top of the file contains information (e.g. annotation) on the data that you don’t want included in the tibble.
n_max=: Specify the max number of lines/rows to read into the tibble. Useful for only getting a small amount of the data for creating/testing/debugging code.

File path column

The id= option can be used to add an ID column with the file path of the data. The important features of this are:

The column’s name is specified as a string e.g. id = "file_path".
The file path column is added as the first column to the resulting tibble.
The string specified as file= is used as the values in the created column.
The values of the resulting column are identical.

ID as file path

Read in the file https://neof-workshops.github.io/Tidyverse/data/all_plant_details.csv setting “file_path” to id=.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/all_plant_details.csv",
    delim = ",",
    id = "file_path") |>
    #Slice and select the first 5 rows and 6 columns
    dplyr::slice(1:5) |>  dplyr::select(1:6)

Rows: 155 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): common_name
dbl (32): id, seeds, drought_tolerant, salt_tolerant, thorny, invasive, trop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 5 × 6
  file_path                  id common_name seeds drought_tolerant salt_tolerant
  <chr>                   <dbl> <chr>       <dbl>            <dbl>         <dbl>
1 https://neof-workshops…   425 flowering-…     0                1             0
2 https://neof-workshops…   426 flowering-…     0                1             0
3 https://neof-workshops…   427 flowering-…     0                1             0
4 https://neof-workshops…   428 flowering-…     0                1             1
5 https://neof-workshops…   434 Jacob's co…     0                0             0

ID as file name

Repeat the above using the functions dplyr::mutate() and basename() to only contain the file name (i.e. remove the entire file path except the file name).

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/all_plant_details.csv",
    delim = ",",
    id = "file_name") |>
    #Mutate to only retain file name in the file_name column
    dplyr::mutate(file_name = basename(file_name)) |>
    #Slice and select the first 5 rows and 6 columns
    dplyr::slice(1:5) |>  dplyr::select(1:6)

Rows: 155 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): common_name
dbl (32): id, seeds, drought_tolerant, salt_tolerant, thorny, invasive, trop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 5 × 6
  file_name                id common_name   seeds drought_tolerant salt_tolerant
  <chr>                 <dbl> <chr>         <dbl>            <dbl>         <dbl>
1 all_plant_details.csv   425 flowering-ma…     0                1             0
2 all_plant_details.csv   426 flowering-ma…     0                1             0
3 all_plant_details.csv   427 flowering-ma…     0                1             0
4 all_plant_details.csv   428 flowering-ma…     0                1             1
5 all_plant_details.csv   434 Jacob's coat      0                0             0

Skip lines

Data files may contain information related to the data that is not part of the table at the top of the file. The skip= option allows the first X lines to be skipped for these data files.

If these information lines are not skipped it can confuse the parsing of the data based on the delimiter since the information lines will not have the same amount of the delimiter character as the actual data.

File contents

Check the top 8 lines of the file: https://neof-workshops.github.io/Tidyverse/data/all_plant_details_w_info.csv.

Notice the first three lines are information and not data lines.

readLines("https://neof-workshops.github.io/Tidyverse/data/all_plant_details_w_info.csv", n=8)

[1] "Houseplant environment characteristics"                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[2] "information about indoor plants, with multiple binary and categorical features."                                                                                                                                                                                                                                                                                                                                                                                                                                
[3] "Source: https://www.kaggle.com/datasets/noneee/houseplant-environment-characteristics"                                                                                                                                                                                                                                                                                                                                                                                                                          
[4] "id,common_name,seeds,drought_tolerant,salt_tolerant,thorny,invasive,tropical,indoor,flowers,cones,fruits,edible_fruit,leaf,edible_leaf,cuisine,medicinal,poisonous_to_humans,poisonous_to_pets,sunlight_part_sun_part_shade,sunlight_full_shade,sunlight_deep_shade,sunlight_part_shade,sunlight_full_sun_only_if_soil_kept_moist,sunlight_full_sun,sunlight_filtered_shade,care_level_encoded,maintenance_encoded,watering_encoded,growth_rate_encoded,cycle_perennial,cycle_herbaceous_perennial,cycle_annual"
[5] "425,flowering-maple,0,1,0,1,0,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,2,0,2,0,1,0,0"                                                                                                                                                                                                                                                                                                                                                                                                                              
[6] "426,flowering-maple,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,0"                                                                                                                                                                                                                                                                                                                                                                                                                              
[7] "427,flowering-maple,0,1,0,0,0,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,0"                                                                                                                                                                                                                                                                                                                                                                                                                              
[8] "428,flowering-maple,0,1,1,0,0,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,2,1,1,0,1,0,0"

Read in and skip lines

Read in the data skipping the first 3 lines.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/all_plant_details_w_info.csv",
    delim = ",",
    skip = 3) |>
    #Slice and select the first 5 rows and 6 columns
    dplyr::slice(1:5) |>  dplyr::select(1:6)

Rows: 155 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): common_name
dbl (32): id, seeds, drought_tolerant, salt_tolerant, thorny, invasive, trop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 5 × 6
     id common_name     seeds drought_tolerant salt_tolerant thorny
  <dbl> <chr>           <dbl>            <dbl>         <dbl>  <dbl>
1   425 flowering-maple     0                1             0      1
2   426 flowering-maple     0                1             0      0
3   427 flowering-maple     0                1             0      0
4   428 flowering-maple     0                1             1      0
5   434 Jacob's coat        0                0             0      0

Max number of lines

If you have a very large file and only want to read in a specified amount of lines it is best to use n_max=.

This is better than piping to dplyr::slice() as it saves the computer from needing to read in the whole file to memory (RAM) only to then slice it.

Read in the first 8 lines from https://neof-workshops.github.io/Tidyverse/data/all_plant_details.csv.

Note: The header line is not considered for n_max= therefore the header line and first 8 rows ar read in.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/all_plant_details.csv",
    delim = ",",
    n_max = 8) |>
    #Select the first 6 columns to view
    dplyr::select(1:6)

Rows: 8 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): common_name
dbl (32): id, seeds, drought_tolerant, salt_tolerant, thorny, invasive, trop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 8 × 6
     id common_name     seeds drought_tolerant salt_tolerant thorny
  <dbl> <chr>           <dbl>            <dbl>         <dbl>  <dbl>
1   425 flowering-maple     0                1             0      1
2   426 flowering-maple     0                1             0      0
3   427 flowering-maple     0                1             0      0
4   428 flowering-maple     0                1             1      0
5   434 Jacob's coat        0                0             0      0
6   502 hot water plant     0                0             0      0
7   540 desert rose         0                1             1      1
8   543 maidenhair fern     0                0             0      0