PacBio sequencing operates on the concept of concurrent synthesis and sequencing. This involves immobilizing free dNTP on the substrate, triggering light excitation to generate a fluorescent signal. Due to the confined spaces in the sequencing wells, the fluorescent signal has a limited travel distance. At the culmination of each base synthesis, the fluorescent phosphate group disengages from the dNTP, leading to signal quenching. This results in the generation of a fluorescent signal that fluctuates from weak to strong and back again during each base synthesis. The real-time recording of these signals by a sensor facilitates their conversion into a digital format, creating a time-fluorescence signal strength pulse curve for precise base identification.
Distinguishing 5mC from non-5mC poses distinct challenges compared to the detection of 6mA. The subtle impact of 5mC on the kinetic properties of DNA polymerase makes it challenging to observe significant differences in the pulse curves. Overcoming this hurdle requires the identification of characteristic variables capable of discerning 5mC. Existing knowledge highlights PacBio's ability to recognize base types and generate variable real-time fluorescence pulse curves. To address this, the research team systematically considered various variables to characterize PacBio cytosine bases and their environments. These variables include the context base of the C, the interval between neighboring curve peaks (IPD), and the difference between the starting and closing peaks (PW).
1. Contextual Base Information:
Precision in Base Recognition: PacBio's accurate identification of base types enables the determination of the contextual base for each C.
2. Fluorescence Pulse Curve Variables:
Interval for Time Difference (IPD): Describing the time difference between adjacent curve peaks, the IPD characterizes the timing variation between bases within the synthesized DNA strand.
Peak-to-Peak Difference: Characterizing the time difference between the entry and exit of a base into the synthesized DNA strand, the difference between the peaks of a curve provides valuable insights.
By harnessing these defining variables, the research team sought to extract additional insights into the characteristics of cytosine bases and their immediate surroundings in the PacBio sequencing process. This exploration aims to uncover a robust assay for the detection of 5mC.
To create robust training datasets, the authors craft a Whole Genome Amplification (WGA) sequencing dataset as a negative test dataset, utilizing unmethylated dNTP amplification. For the positive test dataset, a sequencing dataset with CpG methylation treated by the M.SssI enzyme is employed. The negative test dataset is characterized by predominantly unmethylated sites, with potential methylation signals arising exclusively from the background genome's methylated sites.
The M.SssI enzyme, sourced from an E. coli strain, becomes a pivotal element. This enzyme, carrying a methyltransferase gene from Sprioplasma sp. MQ1, induces methylation at all CpG sites in double-stranded DNA.
Positive training samples are extracted from the M.SssI enzyme-treated positive dataset, while additional training samples are selected from the negative test dataset with moderate CpG site amounts. These are combined for training the Hidden Markov Model (HMM). The remaining samples are reserved for model evaluation testing.
PacBio's Sequel II sequencing kits are employed to generate sequencing data for model training.
The HK model demonstrates its capability to distinguish effectively between methylated and unmethylated cytosines across diverse test data generated by different sequencing kits. An AUC curve analysis identifies a critical cut-off value of 0.5 for this discrimination.
An additional Hidden Markov Model (HMM) is introduced to assess the 5mC assay's performance, particularly for a BC01 sample with high sequencing depth. Notably, the HMM's methylation detection performance for the BC01 sample (83% sensitivity + 84% specificity) is found to be lower than that of the CNN-based HK model (87% sensitivity + 92% specificity).
The study delves into the impact of varying window size, contextual sequence length, and sequencing depth on methylation detection by the HK model.
References: