Data Quality Control of High-Throughput Sequencing: Importance and Protocols

CD Genomics Blog

Explore the blog we've developed, including genomic education, genomic technologies, genomic advances, and genomics news & views.

Posted on January 27, 2021

The most powerful approach to scan for unspecific germline variations, somatic mutations, and structural variants is high-throughput sequencing. Whole-genome sequencing, exome sequencing, and targeted region sequencing are three of the most common sequencing concepts in DNA sequencing. While highly useful, in different aspects of data collection, computation time, and variant detection precision, sequencing data creates significant bioinformatics issues. At the raw data point, other than the coordination and the variant calling, there is typically an emphasis on quality management. Quality management, nevertheless, is important for all three phases: quality control acts as fast testing at the raw data level to remove data with extreme quality problems and to mark data with dubious quality. Quality management at the alignment level concentrates on the quality of the alignment, which is important for the effective identification of variants. The last opportunity to distinguish specimens with quality concerns that are not found at early phases is quality control on variant calling and to further decrease false-positive variants.

Three steps of data quality control:

Quality control of raw data

The initial phase in data processing for any good research should be raw data quality control. The FASTX-Toolkit, which can confirm base quality and nucleotide distribution, is one of the first tools used for raw sequencing data quality control. The FastQC kit is a more sophisticated instrument concerned with raw data quality management. FastQC provides several external metrics for quality management that are not used in the FASTX-Toolkit, such as the mean per-read base quality ratings, the allocation of GC information, and the detection of the most duplicated reads. More significantly, to determine the quality management of raw data, FastQC will use compatible BAM files instead of FASTQ files.

Quality control on alignment

For any re-sequencing research, alignment gives more insights into the consistency of the specimen and can help detect poor specimens that pass the quality management tests for raw data. Alignment for quality management, though, is not achieved on a daily basis. For exome sequencing and whole-genome sequencing, distinct alignment quality control parameters should be obtained. Three main exome sequencing capture kits are available for exome sequencing: Illumina TrueSeq, Agilent SureSelect, and NimbleGen SeqCap EZ. For the exome capture kits, the capture regions vary from 37.6 to 62.1 million base pairs. The capture efficiency differs by capture method. The most significant quality control variable in exome sequencing or other selective sequencing is capture efficiency.

Quality control on variant calling

Identifying single nucleotide polymorphisms (SNP) is one of the crucial stages leading to the final conclusion of the study for most exome sequencing research. In addition to helping to detect bad specimens that have slipped past raw data and alignment quality control checks, quality control on SNP calling would also help to reduce the rate of false-positive SNP calling. There are some cases, such as cross-contamination and mislabeling, where a bad specimen can pass through the raw data and alignment quality control tests. When the DNA of multiple organisms is inadvertently combined, cross-contamination occurs. Mislabeling occurs when, due to human error, products are shifted. Both scenarios generate DNA that does not reflect the planned test sample and will generate high-quality raw data and alignment. Descent identities or simple genotype consistency are helpful figures for detecting bad specimens caused by cross-contamination. It is difficult to detect mislabeling unless more specimens are sequenced from the same pedigree.

References:

Guo Y, Cai Q, Samuels DC, et al. The use of next generation sequencing technology to study the effect of radiation therapy on mitochondrial DNA mutation. Mutation Research/Genetic Toxicology and Environmental Mutagenesis. 2012, 744(2).
Guo Y, Long J, He J, et al. Exome sequencing generates high quality data in non-target regions. BMC genomics. 2012, 13(1).
Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PloS one. 2012, 7(2).
DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011, 43(5).