Chapter 2 Normalising your data

As mentioned in the main tutorial document there are several different methods for normalising your data for downstream analysis. Here we will look at three different methods of normalising your data for downstream analysis.

To summarise so you have several methods in one place, we will also include the normalisation method that we used in the main tutorial.

Remember these are just a few examples of how to normalise your data and we recommend you read the recent literature and think about the properties of your own dataset when deciding the best way to proceed.

2.1 Relative abundance

The transform_sample_counts function transforms the sample counts from a taxa abundance matrix using a user-provided function. The counts of each sample will be transformed individually.

In this case we will transform the raw samples into relative abundances by dividing individual ASV counts by the total ASV counts in a sample.

phylo.prop <- transform_sample_counts(phylo, function(otu) otu/sum(otu))

2.2 Rarefaction

Rarefaction involves randomly removing reads until you reach a number often equal to or less than the number of reads in the smallest sample.

Here we rarefy to an even depth using the minimum number of reads found in a sample as specified by the sample.size=min(sample_sums(phylo)) argument.

The argument rngseed = 1 sets a random number seed used for random number generation, which is then used when subsampling a random number of reads from each sample. By setting the random seed you can make this process reproducible.
We have set the argument replace as replace = F. From the phylo.rarefied R help page "Two implications to consider are that (1) sampling with replacement is faster and more memory efficient as currently implemented; and (2), sampling with replacement means that there is a chance that the number of reads for a given OTU in a given sample could be larger than the original count value, as opposed to sampling without replacement where the original count value is the maximum possible."

phylo.rarefied <- rarefy_even_depth(phylo, rngseed = 1, 
                                    sample.size=min(sample_sums(phylo)), 
                                    replace = F)

2.3 Variance stabilising transformation using DESeq2

Variance stabilising transformation borrows methods from transcriptomic research using the DESeq2 package.

The workflow below is based on the workflow for variance stabilised data suggested here.

First we load the DESeq2 library.

library(DESeq2)

Now transform the phyloseq object into a format appropriate for DESeq2.

ds <- phyloseq_to_deseq2(phylo, ~SITE)

The function estimateSizeFactors is used to estimate the size factors for a DESeqDataSet - for more information on the normalisation methods used by DESeq2 please read the DESeq2 paper and manual.

ds<-estimateSizeFactors(ds)

For our dataset we get the error message "Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means"

This indicates we have many zeros in our count data (which is common in amplicon studies).

There are several ways to deal with this problem. One way is to calculate size factors separately using a zero-tolerant variant of geometric mean. For more details of this method see here.

# calculate geometric means prior to estimation of size factors
gm_mean <- function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans <- apply(counts(ds), 1, gm_mean)

We can now proceed and calculate size factors and dispersions of our dataset, before calculating the variance stabilizing transformation and transforming the count data with getVarianceStabilizedData.

ds <- estimateSizeFactors(ds, geoMeans = geoMeans)
ds <- estimateDispersions(ds)
ds_vst<-getVarianceStabilizedData(ds)

We then convert the transformed dataset back into a phyloseq object for further analysis.

ds_count_phylo <- otu_table(ds_vst, taxa_are_rows=T)
sample_info_tab_phy <- sample_data(meta.rmlow)
ds_phylo <- phyloseq(ds_count_phylo, sample_info_tab_phy)

This is just one way to deal with the problem of zero counts in our dataset. For more discussion on using DESeq2 for metabarcoding data and dealing with zero count data please check out the following links: