Lu Zhang1, Milena Kovacevic2, Milos Popovic2, Sebastian Wernicke1 1Seven Bridges Genomics, Cambridge, MA, USA; 2Seven Bridges Genomics, Belgrade, Serbia |
Abstract
Background: Whole-exome sequencing has demonstrated remarkable success in identifying disease-causing genetic variation. In order to be applicable in clinical settings, however, consistency and accuracy of variant detection in sequencing data must be ensured. The same data set should yield the same variants.
Method: We compared the SNPs and indels (small insertions and deletions) detected by two variant calling pipelines and using two different sequencer runs on the same sample. Pipeline 1 was a “standard” BWA + GATK pipeline with all parameters, including subsequent filtering and region selection, conforming to the best practices described by The Broad Institute. Pipeline 2 employed the open source Mosaik aligner and the FreeBayes variant caller, both with best practice parameter settings recommended by the tool authors. Input data was the most sequenced human individual from the 1000 Genomes project, NA12878 (CEU), including two sets of exome pair-end sequencer runs, SRR292250 from an IIlumina HiSeq 2000 and SRR088693 from an Illumina GA II. The identified exome variants were evaluated against a comprehensive set of QC metrics, including distribution of called variants relative to the dbSNP 135 database, SNP Transition to Transversion Ratio, as well as SNP and indel call set concordances between the replicates and pipelines.
Results: Both pipelines identified a high percentage of SNPs that were present in dbSNP 135 (99.73% and 97.49%, respectively); the consensus set had the highest concordance with known variants (99.80%). With BWA + GATK, the technical replicates resulted in a consensus of 21690 out of 22254 known SNPs, or 97.5%. The Mosaik + Freebayes pipeline identified 687 novel SNPs compared to 67 for the BWA+GATK pipeline; however, judging by Ti/Tv ratio, the former are more likely to be sequencing errors while the latter more likely to be “true” variants. The BWA + GATK pipeline identified 2112 indels with a high concordance to known indels of >91%, compared to 2457 indels with concordance >82% using Mosaik+Freebayes. As with the SNPs, the consensus set of Indels from both pipelines produced the highest concordance of over 95%. With BWA + GATK, the technical replicates resulted in a low consensus of only 889 out of 1147 indels, or 77.5% (BWA+GATK).
Conclusion: Although both variant calling pipelines achieved a relatively high accuracy on each dataset, each pipeline as well as each sequencer run produced a significant number of non-overlapping results. This demonstrates that the empirical choice of programs and parameters can miss important variants and contain false positives. While the accuracy of variant calls can be increased through a consensus of methods (as was done, e.g., in the 1000 Genomes Project), our analysis indicates that the current state of alignment and variant calling is not ready to standardize on a single tool. It will be imperative to ensure consensus results in order for NGS data to be clinically useful.
Back to Next Generation Sequencing (2) |
---|