Column types

When reading in a file with readr::delim() the function will try to infer the type/class of data in each column.

More info on tibbles and data classes in the tibble structure page.

Inferring the column types is not always correct and you can choose the column types with the col_types= option of the reader::read_delim() function.

Column types can be specified with a compact string of single character representations. Each character represents the type of each column, with a number of strings equal to the number of columns (examples below). The characters and their data type are:

Data

We’ll read in a the file simple_example.csv to first specify column types.

Check file contents

View the file contents before reading it as a tibble.

readLines("https://neof-workshops.github.io/Tidyverse/data/simple_example.csv")
[1] "Integers,Doubles,Characters,Factors,Logicals"
[2] "101,0.345,One,A,TRUE"                        
[3] "21,3.14,sentence,B,TRUE"                     
[4] "3,78.9,is,B,FALSE"                           
[5] "0,20000.9,enough,A,FALSE"                    

Specify column types

Read in the file specifying the column types.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/simple_example.csv",
    delim = ",", 
    col_types = "idcfl")
# A tibble: 4 × 5
  Integers   Doubles Characters Factors Logicals
     <int>     <dbl> <chr>      <fct>   <lgl>   
1      101     0.345 One        A       TRUE    
2       21     3.14  sentence   B       TRUE    
3        3    78.9   is         B       FALSE   
4        0 20001.    enough     A       FALSE   

Default inference

By default the readr::read_delim() will infer the data types of the columns. If using this method it is always important to ensure the column types are what you want before further analysis.

Inference with message

Read in the file and leave col_types() to the default inference method (i.e. do not specify the option).

Note there is information on the Column specifications between the Row and Column amounts and the tibble itself.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/simple_example.csv",
    delim = ",")
Rows: 4 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Characters, Factors
dbl (2): Integers, Doubles
lgl (1): Logicals

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 5
  Integers   Doubles Characters Factors Logicals
     <dbl>     <dbl> <chr>      <chr>   <lgl>   
1      101     0.345 One        A       TRUE    
2       21     3.14  sentence   B       TRUE    
3        3    78.9   is         B       FALSE   
4        0 20001.    enough     A       FALSE   

Overall it is fairly good but it has:

  • Set the Integers column to doubles. Doubles are a safer option to set numbers to than integers (doubles can have decimal points whilst integers cannot). However, in data science we may work with data that we want in whole numbers and therefore want the column to be an integer column, ie. discrete data. Examples of discrete data include number of individuals, items, or game points.
  • Set the Factors column to the characters. Our factors are words which are interpreted as strings.

Inference without message

I do not find the column specification message to be that useful. You can quiet the message by setting the option show_col_types = FALSE.

readr::read_delim(
    file = "https://neof-workshops.github.io/Tidyverse/data/simple_example.csv",
    delim = ",",
    show_col_types = FALSE)
# A tibble: 4 × 5
  Integers   Doubles Characters Factors Logicals
     <dbl>     <dbl> <chr>      <chr>   <lgl>   
1      101     0.345 One        A       TRUE    
2       21     3.14  sentence   B       TRUE    
3        3    78.9   is         B       FALSE   
4        0 20001.    enough     A       FALSE