Unveiling 5mC Methylation with PacBio Sequencing and Machine Learning

Introduction to the PacBio Sequencing Principle

PacBio sequencing operates on the concept of concurrent synthesis and sequencing. This involves immobilizing free dNTP on the substrate, triggering light excitation to generate a fluorescent signal. Due to the confined spaces in the sequencing wells, the fluorescent signal has a limited travel distance. At the culmination of each base synthesis, the fluorescent phosphate group disengages from the dNTP, leading to signal quenching. This results in the generation of a fluorescent signal that fluctuates from weak to strong and back again during each base synthesis. The real-time recording of these signals by a sensor facilitates their conversion into a digital format, creating a time-fluorescence signal strength pulse curve for precise base identification.

Challenges in 5mC Detection

Distinguishing 5mC from non-5mC poses distinct challenges compared to the detection of 6mA. The subtle impact of 5mC on the kinetic properties of DNA polymerase makes it challenging to observe significant differences in the pulse curves. Overcoming this hurdle requires the identification of characteristic variables capable of discerning 5mC. Existing knowledge highlights PacBio's ability to recognize base types and generate variable real-time fluorescence pulse curves. To address this, the research team systematically considered various variables to characterize PacBio cytosine bases and their environments. These variables include the context base of the C, the interval between neighboring curve peaks (IPD), and the difference between the starting and closing peaks (PW).

1. Contextual Base Information:

Precision in Base Recognition: PacBio's accurate identification of base types enables the determination of the contextual base for each C.

2. Fluorescence Pulse Curve Variables:

Interval for Time Difference (IPD): Describing the time difference between adjacent curve peaks, the IPD characterizes the timing variation between bases within the synthesized DNA strand.

Peak-to-Peak Difference: Characterizing the time difference between the entry and exit of a base into the synthesized DNA strand, the difference between the peaks of a curve provides valuable insights.

By harnessing these defining variables, the research team sought to extract additional insights into the characteristics of cytosine bases and their immediate surroundings in the PacBio sequencing process. This exploration aims to uncover a robust assay for the detection of 5mC.

Machine Learning Workflow

Dataset Preparation

To create robust training datasets, the authors craft a Whole Genome Amplification (WGA) sequencing dataset as a negative test dataset, utilizing unmethylated dNTP amplification. For the positive test dataset, a sequencing dataset with CpG methylation treated by the M.SssI enzyme is employed. The negative test dataset is characterized by predominantly unmethylated sites, with potential methylation signals arising exclusively from the background genome's methylated sites.

Enzymatic Insight

The M.SssI enzyme, sourced from an E. coli strain, becomes a pivotal element. This enzyme, carrying a methyltransferase gene from Sprioplasma sp. MQ1, induces methylation at all CpG sites in double-stranded DNA.

Model Training

Positive training samples are extracted from the M.SssI enzyme-treated positive dataset, while additional training samples are selected from the negative test dataset with moderate CpG site amounts. These are combined for training the Hidden Markov Model (HMM). The remaining samples are reserved for model evaluation testing.

Sequencing Technology

PacBio's Sequel II sequencing kits are employed to generate sequencing data for model training.

Model Proficiency

The HK model demonstrates its capability to distinguish effectively between methylated and unmethylated cytosines across diverse test data generated by different sequencing kits. An AUC curve analysis identifies a critical cut-off value of 0.5 for this discrimination.

Comparative Evaluation

An additional Hidden Markov Model (HMM) is introduced to assess the 5mC assay's performance, particularly for a BC01 sample with high sequencing depth. Notably, the HMM's methylation detection performance for the BC01 sample (83% sensitivity + 84% specificity) is found to be lower than that of the CNN-based HK model (87% sensitivity + 92% specificity).

Exploring Variables

The study delves into the impact of varying window size, contextual sequence length, and sequencing depth on methylation detection by the HK model.

References:

Tse OYO, Jiang P, Cheng SH, Peng W, Shang H, Wong J, Chan SL, Poon LCY, Leung TY, Chan KCA, Chiu RWK, Lo YMD. Genome-wide detection of cytosine methylation by single molecule real-time sequencing. Proc Natl Acad Sci U S A. 2021 Feb 2;118(5):e2019768118.
Choy LYL, Peng W, Jiang P, Cheng SH, Yu SCY, Shang H, Olivia Tse OY, Wong J, Wong VWS, Wong GLH, Lam WKJ, Chan SL, Chiu RWK, Chan KCA, Lo YMD. Single-Molecule Sequencing Enables Long Cell-Free DNA Detection and Direct Methylation Analysis for Cancer Patients. Clin Chem. 2022 Sep 1;68(9):1151-1163.
Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods. 2010 Jun;7(6):461-5.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services