Most cancer genome studies conducted so far have utilized short-read sequencing, which have primarily allowed for the identification of small-scale genomic alterations like single nucleotide variants (SNVs) and short insertions and deletions (InDels). However, recent advancements in sequencing technologies have enabled the detection of larger genomic structural variants (SVs) in various cancer types. These SVs are expected to hold significant biological and clinical relevance.
Structural variants involve substantial rearrangements in the genome, such as chromosomal inversions and translocations. These alterations can give rise to oncogenic fusion genes, such as BCR-ABL, EML4-ALK, and KIF5B-RET. Large segment deletions are also common in tumor suppressor genes like TP53, RB1, and PTEN, leading to the inactivation of their expression and function.
Recognizing the importance of SVs, the Genome-Wide Pan-Cancer Analysis Consortium has focused on investigating large-scale genomic structural variations in addition to SNVs. The consortium has reported SV signatures for 38 different cancer subtypes, aiming to enhance our understanding of these alterations across diverse cancers.
While conventional analytical methods can infer the presence of SVs from short-read sequencing data, they often provide only partial information about the complete structure of these variants. To achieve more accurate and comprehensive detection of SVs, long-read sequencing technologies should be employed. Long-read sequencing enables the generation of extended reads, allowing for the direct observation and precise characterization of complex structural rearrangements within the cancer genome.
By utilizing long-read sequencing, researchers can obtain a more detailed and holistic view of genomic structural variants, thus enabling a deeper understanding of their functional implications and potential clinical significance in cancer research.
The research used genomic haplotype analysis to investigate non-small cell lung cancer (NSCLC) in 20 Japanese patients. They utilized both long-read and short-read whole-genome sequencing (WGS) data to perform joint phasing analysis.
To identify single nucleotide polymorphisms (SNPs), the researchers compared the second-generation data using BWA-MEM and GATK. They used minimap2 to compare the long read-length sequences. Based on the identified SNPs, they performed typing to elucidate the variability of somatic structural variations (SVs) and single nucleotide variations (SNVs) at the haplotype level.
The study showed that haplotyping based on SNP information was successful, and approximately 56% of the SNPs detected in the normal genome were assigned to haplotype blocks. The sequencing depth was evaluated, and the results indicated that constructing haplotypes seemed to saturate at a depth of approximately 20x-30x, with around 5000 constructs. The study concluded that sequencing data with a minimum depth of 20x could be reasonably used for tumor phasing analysis.
To assess the accuracy of the typing results, the researchers specifically evaluated the correlation of haplotype blocks obtained for two given SNPs. The results demonstrated that the rate of difference between tumor and normal genomes for these two SNPs was similar to previous findings, indicating reasonable accuracy. Additionally, when comparing the typing results with those from a different healthy Japanese cohort, 98.7% of the SNP-SNP associations were consistent. This suggests that the phasing information obtained from tumor and normal genomes is precise and can serve as a reference for further analysis of genomic mutations at the haplotype level.
Firstly, when comparing the haplotype blocks of tumors with those of normal tissues, it was observed that the tumor haplotype blocks had fewer numbers but were longer in terms of N50 (a measure of contiguity). However, they contained a similar number of single nucleotide polymorphisms (SNPs) compared to normal tissues. This suggests that the tumor genomes had larger and more contiguous haplotype blocks, likely due to the clonal expansion of tumor cells and loss of heterozygosity in cancer genomes.
Secondly, the association between the sequenced haplotype blocks and sequencing depth or read length was evaluated. It was found that sequencing depth exhibited a positive correlation with phase block length. This suggests that higher sequencing depth increases the likelihood of obtaining longer haplotype blocks. Additionally, a strong correlation was detected between the length of individual reads and the length of the constructed phase blocks. This implies that the length of individual reads plays a more significant role in determining the resulting phase blocks compared to sequencing depth.
Furthermore, the precise generation of haplotype blocks was assessed in all 20 cases. The results showed that, on average, 78% of the genomic regions contributed to the phased blocks. This indicates that a significant portion of the genome could be accurately phased. However, the remaining 22% of phased blocks, referred to as low-coverage regions, could not be adequately covered. These low-coverage regions were primarily associated with regions characterized by low-heterozygous SNPs, suggesting that the presence of low-diversity or homozygous regions presents challenges for accurate haplotype phasing.
Overall, these findings highlight the impact of sequencing depth and read length on the generation of haplotype blocks and emphasize the influence of tumor characteristics on the composition and contiguity of haplotypes in cancer genomes.