5 16S rRNA Analysis Method Assessment

Method steps
1. Sample to Seq 1. Seq to OTU
2. Downstream analysis 3. Taxonomic Classification
4. Differential Abundance
5. Ecological Diversity

5.1 Sample to fastq

Benchmarking
- using in vitro mock communities (e.g. cells, DNA, or PCR products)
- technical replicates?
- well characterized benchmarking samples - similar to Genomic RMs (deep sequenced multiple methods)
- Mixture/ titrations
- Standard additions
- Dilution to extinction?
Experimental design
Brooks et al. (2015) generated 80 mock communities using different combinations of 7 bacterial strains generating mixtures of cells, DNA, and PCR products to characterize the contributions of different steps in the measurment process. Evalualted the results using the mixture effect model.

5.2 Seq to OTU

5.2.1 Pre-processing

Fastq Quality Assessment
- fastqc
- R ShortRead
- R Rqc
- Issues
  - method developed for evaluating individual datasets - metagenomics large numbers of samples
  - read level information - how to best seperate out reads of different quality
    - DADA2 requires sequences all of the same length, good to be able to analyze on a per reads level
      - Huse 454 work showing that multiple errors present in some reads wheresas other reads had no errors
      - Want to be able to identify error free reads for OTU calling then use all reads for abundance estimates
Single org samples
Data sets from sequencing individual organisms using a standard 16S metagenomic pipeline had been used to evaluate sequencing errors and optimize filtering parameters.
- Kunin et al. (2010) evaluated the sequencing error rates and use of different filtering thresholds for 454 pyrosequencing of an E. coli isolated.
  Schloss, P. D. (2010). The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S rRNA Gene-Based Studies. PLoS Comput Biol, 6(7), e1000844. http://doi.org/10.1371/journal.pcbi.1000844
Multiple Org
- Schirmer et al. (2015) using a mock community of 59 strains (49 bacteria and 10 archaea) evaluated differnt sequence processing methods and parameters for read filtering and quality trimming, merging paired end reads, and error correction. Methods were evaluated based on error rate reduction.

Albanese, D., Fontana, P., De Filippo, C., Cavalieri, D., & Donati, C. (2015). MICCA: a complete and accurate software for taxonomic profiling of metagenomic data. Scientific Reports, 5, 9743. http://doi.org/10.1038/srep09743
* Evaluates preprocessing % read passing filtering, chimeras, and redundant OTUs, relative abundance

Gaspar, J. M., & Thomas, W. K. (2013). Assessing the consequences of denoising marker-based metagenomic data. PloS One, 8(3), e60458. http://doi.org/10.1371/journal.pone.0060458
* Evaluation of denoising and chimera detection methods for 454 data

5.2.2 Chimera Detection

Mysara, M., Saeys, Y., Leys, N., Raes, J., & Monsieurs, P. (2015). CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies. Applied and Environmental Microbiology, 81(December), 1573–1584. http://doi.org/10.1128/AEM.02896-14

Beblur - https://github.com/ekopylova/deblur

5.2.3 OTU Assignment

De novo clustering has been shown to be more accurate (S. L. Westcott and Schloss 2015)
Types of datasets used to assess benchmarking methods: simulated data, mock communities, environmental datasests
Approaches to evaluating clustering methods
- Cluster stability - similarity in clustering results when varying input data
- Cluster reproducibility - similarity in clustering results between methods
- Number of artifact OTUs - number of observed OTUs relative to expected
  - Evaluating clustering methods using simulated and mock community clustering data provide a level of truth (expectation of the correct result).

5.2.3.1 Summary of Cluster Method Assessment Papers

Kopylova et al. (2014) used a combination of simulated, mock communities, and data from environmental samples to compare clustering methods.
- They evaluated the performance of the clustering methods using F-measure and phylogenetic distance.
- For environment datasets the clustering methods were evaluated based on the disimilarity of samples for a dataset based on the sum of the squared deviation for UniFrac PCoA (Procrustes M²) and similarity to UCLUST (Pearson’s correlation). NOTE Still note sure what they are evaluating for Procrustes.
- The authors excluded singletons from there analysis, while this does help to eliminate spurious OTUs, the method may also exclude rare taxa.
Schloss (2016) argues that using simulated and mock communities is not representative of real environmental samples. The authors additionally, argue the approach used by Kopylova et al. (2014) to evaluating clustering methods confounds the impact of read filtering with clustering methods.
- The author’s evaluated environmental datasets using Mathew’s correlation coefficient, for a truth table (TP, TN, FP, FN) based on distance between sequences in the same and different clusters.
  - This method for evaluating clusters was first used in (Schloss and Westcott 2011)
- Average neighbor performed the best based the MCC based assessment. However, this approach to evaluation is biased towards this method as the results are strictly assessed based on the sequence distance criteria. Other clustering methods that attempt to address more nuianced issues with clustering, sequencing error and variability in sequence diversity allow for differences in distances within and between clusters.
- This approach to cluster evaluation is particularly suited for algorithms based on analytical proofs.
- For example the DNACluster clustering algorithm employs a heuristic based on a proof that for any given cluster the distance between two sequences in the cluster is less than the defined threshold value and no sequence is closer to another cluster center than the center of the cluster it is assigned to (Ghodsi, Liu, and Pop 2011).
  - Ghodsi, Liu, and Pop (2011) used a similar approach to validate their clustering method, though only focusing on the within cluster pairwise distance.
Cluster Robustness is another attribute for evaluating clustering methods (Y. He et al. 2015).
- Cluster robustness is the reproduciblity of the OTU table generated by a clustering algorith.
- Cluster robustness was evaluated based on:
  - Unstable sequences - sequences represented by different centroids
  - Unstable OTUs - OTUs where membership was impacted by the number of sequences in a dataset.
  - using Mathew’s correlation coefficient
    - Truth Table definitions
      - TP if two sequences clustered together for the full and subsampled dataset
      - TN if two sequences clustered separately for the full and subsampled
      - FN two sequences clustered together in full but not subsampled.
      - FP two sequences clustered separately in full but together in subsampled dataset.
    - The MCC method uses the full dataset at the true value to benchmark the subsampled dataset against.
    - NOTE Is this a valid assumption for clustering?
  - Impact of cluster stablity on Rarefaction curves, PCoA with Bray-Curtis and UniFrac.
  - Tested for significance using ADNOIS, a non-parametric test measuring effect size, the amount of the observed variance explained by metadata variables. NEED REF - See vegan manual for potenial references.
- The resulting OTU table is dependent the number of sequences clustered.
- For some methods the OTU table is dependent on the order in which the sequences are provided (Ghodsi, Liu, and Pop 2011).
- Closed reference clustering was the only method the produced stable clusters.
- This method discards sequences that do not cluster with reference sequences.
- Open-reference cluster, which performs de novo clustering of unmatched sequences provides a more robust alternative clustering method to de novo clustering while allowing for the diversity of novel OTUs.
Cluster method robustness can also be calculated using cross validation(W. Chen et al. 2013).
- W. Chen et al. (2013) subsampled the data to 90% five times and compared the clusters between replicates.
- Methods were evaluated using precision, recall, and NID
  - precision - measure of cluster homogeneity
  - recall - measure of cluster completeness
  - NID - assess cluster globally???
- See assessment of cluster quality for metric definitions
Cluster robustness can also be evaluated based on threshold and 16S region (Schmidt, Matias Rodrigues, and Mering 2015).
- The authors focus their assessment more on the comparability of results among different methods and note the importances of reproducibility.
- Quote from paper “how robust are biological findings to the choice of clustering method? We found that OTU demarcation may indeed be replicable: different methods provided (almost) identical partitions when twice clustering the exact same sets of sequences, but in randomized order (Fig. 4, diagonals). However, trends in reproducibility were less clear.”
Cluster method assessment based on OTU inflation.
- Clustering methods have also been evaluated based on the number of predicted OTUs relative to the number of expected OTUs (Huse et al. 2010, Kopylova et al. (2014)).
- Here either simulated or data from a mock community, where the number of OTUs is based on the number of strains or genomes used to generate the assessment dataset.
- The limitation to this approach is that it does not account for contaminants when in vitro mock communities are used, and neither in silico or in vitro datasets represent the true complexity and diversity of environmental samples.

5.3 Downstream analysis

5.3.1 Transformations

NEED TO FIND A HOME
Love, Huber, and Anders (2014) plot the standard deviation and rank mean count values as diagnostic plots validating regularized log transformation method.

5.3.2 taxonomic classification

5.3.3 Differential Abundance

Key parameters
- sample size
- effect size (fold change)
- abundance (count)
Jonsson et al. (2016) evaluation of different methods for whole metagenome sequencing
- Compared ROC for simulated datasets
- sampled datasets and provided truth set for differentially abundant genes by replacing count values using a binomial distribution with \(1/q\) where \(q\) is defined as the effect size.
Anders and Huber (2010)
- A key issue with differential expression and differential abundance methods (high thoughout assays in general) is controling for high false positive rates due the multiple tests.
- Figure 2 diagnostic plot for evaluating false postive rate (Type-I) error, with p value plotted against empirical CDF, the expectation is that the empirical CDF will have a linear relationship with p value (slope 1, intercept 0). Lines above expectation represent higher than expected Type-I error rate based on defined \(\a\) and below lower than expected type-I error rate. Low type-I error rate results in higher type-II error rate and lower power or ability to correctly reject the null hypothesis.
Love, Huber, and Anders (2014) - method developed for RNAseq but has been applied to 16S (???)
- used simuated datasets to evaluate sensitivity and false discovery rate for different sample and effect size (Figure 6)
- Used real data with evaluation and verification datasets, example dataset have 10 and 11 replicates for case and control respecitively.
  Randomly split into evaluation and verification datasets with 7/8 and 3/3 replicates, with 30 replicates. Rotated algorithm used in verification. Approach used instead of consensus or single method as to not bias single or group of algorithms.
McMurdie and Holmes (2014) compared differential abundance detection methods in conjunction with different normalization methods
- Simulated OTU differences for read datasets by perturbing OTUs to have defined effect sizes between treatment and control conditions.

5.3.4 ecological diversity

References

Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. “The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies.” BMC Microbiology 15 (1): 66. doi:10.1186/s12866-015-0351-6.

Kunin, Victor, Anna Engelbrektson, Howard Ochman, and Philip Hugenholtz. 2010. “Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates.” Environmental Microbiology 12 (1): 118–23. doi:10.1111/j.1462-2920.2009.02051.x.

Schirmer, Melanie, Umer Z. Ijaz, Rosalinda D’Amore, Neil Hall, William T. Sloan, and Christopher Quince. 2015. “Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.” Nucleic Acids Research 43 (6). doi:10.1093/nar/gku1341.

Westcott, Sarah L, and Patrick D Schloss. 2015. “De novo clustering methods out-perform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units.” PeerJ 3: e1487. doi:10.7287/peerj.preprints.1466v1.

Kopylova, Evguenia, Jose A Navas-molina, Céline Mercier, and Zech Xu. 2014. “Open-Source Sequence Clustering Methods Improve the State Of the Art.” MSystems 1 (1): 1–16. doi:10.1128/mSystems.00003-15.Editor.

Schloss, Patrick D. 2016. “Application of database-independent approach to assess the quality of OTU picking methods.”

Schloss, Patrick D., and Sarah L. Westcott. 2011. “Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis.” Applied and Environmental Microbiology 77 (10): 3219–26. doi:10.1128/AEM.02810-10.

Ghodsi, Mohammadreza, Bo Liu, and Mihai Pop. 2011. “DNACLUST: accurate and efficient clustering of phylogenetic marker genes.” BMC Bioinformatics 12 (1). BioMed Central Ltd: 271. doi:10.1186/1471-2105-12-271.

He, Yan, J Gregory Caporaso, Xiao-Tao Jiang, Hua-Fang Sheng, Susan M Huse, Jai Ram Rideout, Robert C Edgar, et al. 2015. “Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity.” Microbiome 3 (1). ??? 20. doi:10.1186/s40168-015-0081-x.

Chen, Wei, Clarence K. Zhang, Yongmei Cheng, Shaowu Zhang, and Hongyu Zhao. 2013. “A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs.” PLoS ONE 8 (8). doi:10.1371/journal.pone.0070837.

Schmidt, Thomas S B, Jo??o F. Matias Rodrigues, and Christian von Mering. 2015. “Limits to robustness and reproducibility in the demarcation of operational taxonomic units.” Environmental Microbiology 17 (5): 1689–1706. doi:10.1111/1462-2920.12610.

Huse, Susan M, David Mark Welch, Hilary G Morrison, and Mitchell L Sogin. 2010. “Ironing out the wrinkles in the rare biosphere through improved OTU clustering.” Environmental Microbiology 12 (7): 1889–98. doi:10.1111/j.1462-2920.2010.02193.x.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12). BioMed Central: 1.

Jonsson, Viktor, Tobias Österlund, Olle Nerman, and Erik Kristiansson. 2016. “Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics.” BMC Genomics 17 (1). BioMed Central: 78. doi:10.1186/s12864-016-2386-y.

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11 (10). BioMed Central: 1.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Comput Biol 10 (4). Public Library of Science: e1003531.