Digital RNA Sequencing (UMI-RNA-Seq)

The Introduction of Digital RNA Sequencing

RNA sequencing (RNA-seq) is the premier tool for mapping and quantifying transcriptomes by utilizing next-generation sequencing (NGS) technology. The transcriptome refers to the complete set of transcripts in a cell, which provides information on the transcript level for a specific developmental stage or physiological condition. Due to the bias of PCR amplification, different cDNA fragments are unevenly amplified. The easily amplified fragments are greatly enriched during the sequencing process, and some low-content fragments or fragments with severe base bias are even completely lost, which ultimately affects the accuracy of sequencing results. This only allows us to understand the general trend of gene expression, but it cannot satisfy the absolute quantification of the original gene expression level.

With UMI (Unique Molecular Identifier), we label every single molecule before library construction, such that each molecule consists of a unique sequence. Ideally, each template molecule can then be identified by its unique combination of the UMI sequence and the template sequence. After PCR amplification, PCR copies can be identified and eliminated from the dataset and hence uneven amplification and artifacts generated during the PCR can be almost completely eliminated. Quantitative statistics through UMI, the results are naturally more accurate. UMI combines NGS to study gene expression information of a certain species and tissue in a specific space-time state, and achieves high precision in sequence and quantification.

This is especially important for diagnostics as well as for imply of low amounts of starting material. UMI is highly recommended for the analysis of nucleotide populations containing numerous similar sequences, such as small-RNAs, ChIP-Seq tags, Aptamers, RAD-Seq tags or GBS-tags.

You can learn more about UMI RNA-Seq through our article "UMI RNA-Seq: An Effective Method for Eliminating PCR Bias".

Advantages and of Digital RNA Sequencing

Low starting amounts: 100ng can achieve the same sequencing result of 1ug sample, which is more suitable for rare and precious samples;
UMI technology: Add UMI to each cDNA fragment, remove PCR amplification bias, truly reflect the abundance of transcript expression, and achieve accurate and unbiased quantification;
Improve transcriptome sequencing quality: Accurately analyze expression levels and screen differential expression, and accurately lock RNA editing, alternative splicing, and SNP sites;
Suitable for multiple RNA-Seq.

Digital RNA Sequencing Workflow

Digital RNA sequencing, also known as digital RNA-seq or UMI-RNA-Seq, is an absolute quantitative transcriptome sequencing technology, adding a unique molecular identifier (UMI) to each cDNA fragment before library amplification. UMI will accompany the entire process of fragment amplification, sequencing, and analysis. After sequencing, UMI is used to trace the source of each fragment, and combine fragments from the same source (with the same sequence and UMI) to accurately remove PCR amplification duplicates and accurately restore the original state of the sample before amplification. In this process, errors in PCR amplification and sequencing can also be corrected.

Figure 1. Transcriptome sequencing (left) and UMI-RNA-Seq (right)

Service Specification

	Sample Requirements Fresh animal tissue dry weight: ≥ 30mg Fresh plant tissue dry weight: ≥ 100mg Cell number: ≥ 2×10⁵cells Whole blood: ≥ 1 mL Total RNA ≥ 2 μg, Minimum Quantity: 1 μg, Concentration≥ 50 ng/μL OD A260/A280 ratio ≥ 1.8, A260/230 ratio≥ 1.8, RIN ≥ 6 All total RNA samples should be DNA-free RNA should be stored in nuclease-free water or RNA Stable. Note: Sample amounts are listed for reference only. For detailed information, please contact us with your customized requests.
Click	Sequencing Strategies Illumina PE150 6G/per sample
	Data Analysis We provide multiple customized bioinformatics analyses: Gene expression quantification Distribution of reads Samples correlation analysis Differential expression analysis GO enrichment analysis KEGG pathway enrichment of differently expressed genes Transcription factors analysis Co-expression network analysis Note: Recommended data outputs and analysis contents displayed are for reference only. For detailed information, please contact us with your customized requests.

Analysis Pipeline

The Data Analysis Pipeline of UMI-RNA-Seq.

Deliverables

The original sequencing data
Experimental results
Data analysis report
Details in Digital RNA Sequencing for your writing (customization)

CD Genomics provides full digital RNA sequencing service package including sample quality control, library construction, deep sequencing, raw data quality control, DEG analysis, and customized bioinformatics analysis. We can tailor this pipeline to your research interest. Please do not hesitate if you have any questions about our service.

Partial results are shown below:

Distribution graph showing sequencing quality metrics

Sequencing quality distribution

Nucleotide distribution chart for A, T, G, and C bases

A/T/G/C Distribution

Genome visualization in IGV browser with sample data

IGV Browser Interface

Sample correlation analysis scatter plot

Correlation Analysis Between Samples

PCA Score Plot

Venn Diagram

Volcano Plot

GO annotation statistics showing category breakdowns

Statistics Results of GO Annotation

KEGG pathway classification of gene functions

KEGG Classification

1. What is UMI?

Unique Molecular Identifier (UMI) is a molecular barcode used to correct errors during sequencing, thereby enhancing accuracy. These barcodes consist of short sequences that uniquely label each molecule in the sample library. UMIs find utility in various sequencing applications, many of which are associated with PCR duplicates in DNA and cDNA. They are employed in RNA-seq gene expression analysis and other quantitative sequencing methods to eliminate redundancies. UMIs are utilized in both second and third-generation sequencing technologies to improve data quality and reduce artifacts.

2. What is the working principle of UMIs?

UMIs involve adding a unique barcode to each molecule in a given sample library. By attaching distinct barcodes to individual original DNA fragments, the method distinguishes true variant alleles present in the original sample from errors introduced during library preparation, target enrichment, or sequencing processes.

3. Why are UMIs valuable in molecular studies?

UMIs play a crucial role in enabling absolute quantification of molecules within a sample without the need to detect each individual molecule or discern their respective copies. While determining the copy number of each sequence can be challenging, evaluating distinct UMI sequences is more straightforward, preserving this information throughout the amplification process. Additionally, normalization of RNA-Seq datasets of this kind can be achieved without compromising accuracy.

4. When are UMIs useful in RNA-Seq library preparation?

UMIs are primarily employed to mitigate PCR duplicates, aiming to decrease amplification bias and calculate the number of genes/transcripts expressed in a single cell. Hence, UMIs prove most beneficial when the input quantity is limited (ranging from single-cell level to total RNA input amounts lower than 10 ng), while at higher input levels, UMIs may not yield significant advantages, as the number of RNA molecules exceeds the potential UMI sequences.

UMIs can indicate over-sequencing, for instance, when the sequencing depth is disproportionately high compared to the library complexity. Although over-sequencing in itself is not detrimental, avoiding it can lower costs and free up space for including more replicate samples to enhance statistical power. UMIs also aid in assessing accessible transcripts, such as RNA extracted from FFPE samples which exhibit high heterogeneity due to cross-linking, resulting in varying numbers of accessible transcripts per sample.

The methodology for library preparation incorporating UMIs is universal and suitable for all sample types. Handling and deduplicating UMI data are typically optional and can be omitted when dealing with a large volume of input samples to conserve computational resources.

5. What's the difference between barcode and UMI?

Barcodes and UMIs serve distinct purposes in high-throughput sequencing. Barcodes are predefined sequences added to different samples to enable their identification and differentiation after pooled sequencing. Each sample gets a unique barcode, allowing researchers to attribute sequencing reads back to the correct sample. UMIs, on the other hand, are random short sequences (typically 6-12 nucleotides) added to individual RNA molecules during cDNA synthesis. UMIs uniquely tag each original RNA molecule, allowing for the identification and removal of PCR duplicates during data analysis, thus providing accurate molecular quantification by mitigating amplification bias.

Integrated transcriptome and proteome analysis provides new insights into camptothecin biosynthesis and regulation in Camptotheca acuminata

Journal: Physiologia Plantarum
Impact factor: 6.49
Published: 24 April 2023

Abstract

Camptothecin (CPT), mainly extracted from the Camptotheca acuminata, is garnering increasing attention due to its significant anti-tumor activity. Numerous derivatives of CPT are clinically used as potent anticancer agents. However, the biosynthetic mechanism of these compounds remains obscure. Unveiling this pathway to replace the current inefficient plant-derived methods would greatly contribute to the development of alternative means of CPT production.

The study employed UMI RNA-seq, ONT long-read transcriptome sequencing, and label-free quantification (LFQ) proteomics methods to unveil representative transcriptomic and proteomic profiles across various tissues of the C. acuminata. Comprehensive analysis of multi-omics data provided new insights into the biosynthesis of CPT and its differential regulation across different tissues. The research identified genes regulated by tissue-specific alternative splicing (AS) and screened candidate enzymes and TFs involved in the biosynthetic pathway of CPT.

Materials & Methods

Sample Preparation:

Camptotheca seedlings
Camptotheca young leaves
RNA extraction

Sequencing:

Illumina UMI RNA-seq sequencing
Nanopore full-length transcriptome sequencing
Label-free quantitative proteomics

Data Analysis:

Phylogenetic analysis
Gene differential analysis
Correlation analysis
GO enrichment analysis

Results

1. Overview of Camptothecin (CPT) Transcriptome and Proteome in Camptotheca

Utilizing Illumina UMI RNA-seq sequencing across 10 tissues yielded a total of 6.03 billion deduplicated Unique Identifier (UID) reads for precise gene quantification, offering an accurate expression profile for exploring CPT biosynthesis and its regulation.

ONT full-length transcriptome sequencing from 5 C. acuminata tissues produced a total of 80.79 million clean reads, with an average read length of 1033-1293bp per sample (Table 1). Over 73% of the clean reads were identified as Full-Length Non-Chimeric (FLNC) reads. A total of 42,810 shared transcripts were identified, providing a rich resource for investigating post-transcriptional regulation in this species.

Table 1 ONT Reads Statistics

Figure 1. Profiling and characteristics of circRNAs. (Luan et al., 2018)

Through protein data analysis, a total of 9,558 proteins were identified, with 7,854 proteins quantifiable (Figure 1B). The proteinomic data offer insights into dynamic protein expression profiles, elucidating the translational regulatory mechanisms involved in CPT synthesis.

Figure 1 Comprehensive Proteomic Data from Five Camptotheca Tissues. (Zhang et al., 2023) Figure 1 Integrated Proteomics Results of Five C. acuminata Tissues.

2. Dynamic changes in gene expression across various parts of a C. acuminata

Differential analysis was conducted on 30,991 genes identified through UMI RNA-seq, revealing a high heterogeneity in the number of Differentially Expressed Genes (DEGs) across different tissues, especially in YL (Figure 2A). The most significant differences were observed between YL and TR, with a total of 15,041 DEGs detected. This indicates that YL may be involved in more diverse biological processes compared to other tissues. Correlation analysis between transcriptomics and proteomics revealed a moderate relationship between the two datasets (Figure 2B). GO enrichment analysis uncovered genes/proteins enriched in all five tissues, classified as hub genes (Figure 2C). Conversely, genes/proteins preferentially expressed in individual tissues are typically associated with specific functions unique to those tissues.

Figure 2 Combined Analysis of Transcriptome and Proteome. (Zhang et al., 2023) Figure 2 Integrated Analysis of Transcriptomics and Proteomics.

Through a combined analysis of transcriptomic and proteomic data, the expression patterns of genes involved in CPT biosynthesis were elucidated (Figure 3). At the transcriptional level, 10OMT and STR exhibited the highest expression levels in E and YL, while their respective protein abundances were highest in stems. This observation suggests potential intricate hierarchical regulation of CPT biosynthesis at post-transcriptional and translational levels.

Figure 3 Differential Expression of Genes and Proteins Associated with CPT Biosynthesis and Alternative Splicing Events in Various Camptotheca Tissues. (Zhang et al., 2023) Figure 3 Expression Patterns of Genes and Proteins Involved in CPT Biosynthesis and AS Events in Different C. acuminata Tissues.

Alternative splicing (AS) is a crucial mechanism that regulates gene expression and generates protein diversity. Through analysis of the Oxford Nanopore Technologies (ONT) full-length transcriptome, 5692 AS events from 4746 genes were identified. Intron retention (IR) emerged as the most common AS type, consistent with previous reports. Nevertheless, the abundance of AS events varied significantly across different tissues (Figure 4B). IR was most abundant in the leaves (ML), exon skipping (ES) predominated in the flowers, and 3' alternative splicing (AAs) was most enriched in the roots (LR), stems (S), and bark (YL).

Figure 4 Examination of Alternative Splicing Events. (Zhang et al., 2023) Figure 4 Analysis of AS Events.

Through an exhaustive analysis of the co-expression network of all alternative splicing (AS) genes, a total of 14 distinct modules were identified, with 5 modules harboring tissue-specific AS genes. Gene Ontology (GO) enrichment analysis highlighted the regulatory role of AS in tissue-specific biological processes. Furthermore, key genes involved in CPT biosynthesis, including GES, 10HGO, STR1, and 10OMT, underwent AS events. For instance, 10OMT exhibited high expression of the initial transcript in stems, whereas another transcript was predominantly expressed in the stem. Both transcripts of 10HGO were specifically expressed in roots, while the two transcripts of STR1 were specific to YL. These findings suggest that AS plays a role in regulating the tissue-specific expression of CPT biosynthesis-related genes. Additionally, the authors validated AS events of 6 genes in 5 tissues using RT-PCR (Figure 4C).

3. Hypothetical Transcription Factors Regulating CPT Biosynthesis

Transcription factors (TFs) act as critical molecular switches that respond to internal and external signals regulating secondary metabolism. In this study, 2217 genes encoding TFs were identified using UMI RNA-seq, categorized into 52 TF families, with the largest family being AP2/ERF. Utilizing whole-genome co-expression network analysis of TFs and their co-expressed target genes, putative transcription factors that may regulate CPT biosynthesis were identified.

Figures 5A-C depict the correlation networks between GES, 10HGO, STR1, STR2, and their associated TFs. Twenty-one candidate transcription factors highly co-expressed with these three reference genes were selected for qRT-PCR validation. All except C2H2-46 exhibited significant co-expression with GES (Figure 5D). Notably, AP2/ERF186, CPP4, C3H7, CPP8, C2C2-YABBY7, and B3-3 showed significant co-expression with STR and interestingly, with 10HGO as well. Fifteen transcription factors were screened as potential regulators of GES, 10HGO, and STR, yet further validation is required to ascertain the predictive roles of these transcription factors in regulating CPT biosynthesis.

Figure 5 Identification of Potential Transcription Factors Involved in Regulating CPT Biosynthesis. (Zhang et al., 2023) Figure 5 Screening of Potential Candidate Transcription Factors Regulating CPT Biosynthesis.

Conclusion

The study presents representative transcriptome and proteome profiles of the C. acuminata tree across various tissues. Through an integrated analysis of multi-omics data, it offers new insights into the biosynthesis of Camptothecin (CPT) and its tissue-specific hierarchical regulation. The research identifies genes regulated by tissue-specific alternative splicing (AS) and screens selected enzymes and transcription factors (TFs) involved in the CPT biosynthetic pathway. This study provides valuable resources for understanding CPT biosynthesis in C. acuminata, facilitating future research in CPT or its derivatives' production through synthetic biology or plant metabolic engineering.

Reference:

Zhang H, Shen X, Sun S, et al. Integrated transcriptome and proteome analysis provides new insights into camptothecin biosynthesis and regulation in Camptotheca acuminata. Physiologia Plantarum, 2023, 175(3): e13916.