RNA-seq is perhaps the most extensively used method for read length counting, employed to measure and compare the copy numbers of each transcript under diverse cell types or conditions. During library constitution for RNA sequencing, PCR amplification is required. However, not all fragments amplify at equivalent rates; factors such as fragment length, GC content, and concentration significantly affect amplification speed. Consequently, this leads to an over-enrichment of easily amplified fragments, while certain fragments with lower quantities or base preferences are under-enriched, or even lost, ultimately compromising the accuracy of the sequencing results. Classical RNA-seq is hampered by the PCR amplification step designed to generate sufficient DNA molecules for sequencing, which introduces a bias. This PCR bias may lead to overexpression of certain transcripts in the final sequencing library. The issue of PCR duplicates intensifies with the increase in PCR cycle numbers, as is the case in single-cell RNA-seq. UMIs present an effective solution for reducing PCR bias to the maximum extent, allowing for a more accurate estimate of gene expression quantification. UMI RNA-seq, also known as digital RNA-seq, has found widespread application in academic and clinical research.
Figure 1. The use of UMIs in NGS libraries (Roloff et al. 2017).
The fundamental principle of UMI RNA-Seq experiments involves the attachment of a unique sequence label, known as a Unique Molecular Identifier (UMI), to each cDNA molecule. Following this, the UMIs are subsequently amplified alongside the cDNA sequences. Once sequencing is completed, the UMIs are identified which allows for the consolidation of duplicated fragments possessing the same UMIs. This step effectively eliminates unwanted replication potentially caused by unbalanced PCR amplification or sequencing instrumentation. This process achieves a precise one-to-one restoration of the original sample conditions prior to amplification, thereby enabling an accurate quantification of transcript numbers in the initial sample.
Figure 2. Experimental Principles of UMI RNA-Seq.
Services you may interested in
Unique Molecular Identifiers (UMIs), also referred to as molecular barcodes or random barcode technology, consist of short random nucleotide sequences serving as distinctive labels appended to each molecule within a specimen. UMIs are customarily designed as completely randomized nucleotide chains (e.g., NNNNNN), partly degenerative nucleotide chains (such as NNNRNYN, where both R | Y can represent any of the nucleotide types ATCG), or alternatively as fixed nucleotide chains in instances where template molecules are limited in numbers and diversity. The integration of UMIs transpires during the process of library construction, introduced before amplification of the finalized library fragments.
Figure 3. UMI in the form of short sequences of partially degenerate nucleotides.
UMIs are increasingly being utilized for reliably identifying PCR duplicates in Next-Generation Sequencing (NGS) experiments, most notably in RNA-seq. UMIs serve as intricate indicators that are incorporated into each fragment location of a sample prior to PCR amplification during library preparation. This allows for the identification of the source molecule for each read and accurate detection of true PCR duplicates, since they possess the same UMI sequence and alignment coordinates. UMIs find applicability in a variety of sequencing methods that demand precise quantification or detection of rare mutations or low-input quantities, such as RNA-seq, single-cell RNA-seq, and immunome library sequencing.
The underlying principle of UMI technology involves adding a unique short tag sequence of 3-8 base pairs to each original DNA molecule prior to PCR amplification. These are then sequenced jointly after library construction and PCR amplification. Readout sequences with different molecular barcodes represent different original DNA molecules, while those with the same barcode are the result of identical original molecules duplicated via PCR. This way, we can distinguish different DNA templates based on different tag sequences. Although molecular barcoding cannot prevent the production of PCR duplicates, it can help users track these duplicates and eliminate them during downstream analysis. Moreover, the use of molecular barcodes can differentiate PCR false positives from genuine variations in original molecules, thus enabling mutation detection when the variant allele frequency (VAF) is extremely low.
The underlying principle of UMIs, is adeptly illustrated through the theory of double-strand molecular tags as presented in research conducted by Michael W. Schmitt, et al. (refer to Figure 4).
Figure 4. Principle of UMI identification of erroneous mutations. (Schmitt et al., 2012)
In the given diagram, α and β represent random Unique Molecular Identifier (UMI) sequences attached to the DNA template. The αβ-oriented tag indicates the sense strand of the DNA, while the βα-oriented tag denotes the antisense strand.
Three varying scenarios are represented by i, ii, and iii:
(i) Mutations that occur only in one or a few sense strands (represented by pink dots) are indicative of sequencing errors or errors introduced by Polymerase Chain Reaction (PCR) amplification, particularly during the later stages of PCR amplification.
(ii) Mutations that occur in all sense strands (denoted by yellow dots) derive from errors introduced in the first round of PCR, which might be triggered when copying across a damaged DNA mutation site.
(iii) Numerous mutation sites appear on both strands of the DNA fragment. However, only the mutation denoted by the green dot, simultaneously present at corresponding positions on both sense and antisense strands, is determined to be an inherent mutation in the sample. Other mutation sites (represented by blue, purple, and brown dots) that exclusively occur on the sense or antisense strands without simultaneously appearing at corresponding positions on both strands are deemed false-positive mutations.
UMI and UDI (Unique Dual Index) are both technologies utilized in molecular biology research for sample identification, albeit with differing designs and applications. UMI primarily serves to identify the original copies of RNA or DNA molecules, thereby mitigating biases introduced by PCR amplification and repetitive sequences. In contrast, UDI is predominantly used to label different samples or experimental conditions, distinguishing data from various sources. While UMI functions as a unique identifier for each RNA or DNA molecule, UDI serves as a unique identifier for entire sequencing samples. UMIs typically consist of random nucleotide sequences, whereas UDIs typically comprise dual index sequences composed of two specific sequences.
Table1 Difference Between UDI and UMI
Feature | UMI | UDI |
Definition | Each RNA or DNA molecule has a unique identifier | Double index sequences used to distinguish different samples |
Application | RNA sequencing, DNA sequencing, especially in single-cell and low-input samples | Commonly used in multi-sample or multi-condition sequencing experiments to label the origin of different samples |
Sequence characteristics | Typically random sequences of 8 to 12 nucleotides | Consists of two different sequences, each representing a different sample or experimental condition |
Main purpose | Eliminate biases and duplicate sequences introduced during PCR amplification | Differentiate between samples to identify and analyze data from different sources in mixed sequencing experiments |
Field of application | Single-cell RNA sequencing, low-input sample analysis, etc. | Multi-sample sequencing, mixed sequencing experiments, etc. |
Comparing UMI and UDI can assist researchers in better understanding their design principles, scope of applications, and advantages and disadvantages, thus enabling the informed selection of identification technologies that best suit experimental needs. Specifically, comparing their differences and characteristics can aid in selecting appropriate experimental designs tailored to specific research objectives and sample properties. For instance, in single-cell RNA sequencing, UMIs are commonly used to label the original copies of each RNA molecule, while UDIs are often employed to identify single-cell samples from different sources. Furthermore, understanding the distinct features of UMI and UDI can help researchers choose suitable data processing and analysis methods, thereby enabling more accurate identification and quantification of RNA or DNA molecules, revealing intricacies and complexities within biological processes.
By making informed choices regarding UMI and UDI identification technologies, researchers can enhance experimental throughput and efficiency, reduce experimental costs, and mitigate risks of sample cross-contamination, thereby generating high-quality and reliable data.
The workflow of UMI RNA-seq consists of RNA isolation, rRNA removal, cDNA library construction with UMI barcodes (Figure 5), library quality assessment, deep sequencing, and data analysis. In the library construction step, the rRNA-depleted RNA is fragmented and reversely transcribed into cDNA along with ligation with UMI adapter, followed by library amplification and library QC.
Figure 5. UMI incorporation and library amplification in UMI RNA-seq experiments (Dixit 2016).
Here is an introduction to the main processes of UMI RNA-seq:
RNA Extraction: Total RNA should be isolated from the sample using a standard RNA extraction kit. To avoid degradation and contamination during RNA extraction, it is recommended to maintain RNA integrity by working on ice.
UMI Introducing: Utilize a UMI incorporation kit to introduce UMI sequences into RNA molecules. We must ensure a proper junction of the UMI sequence with the RNA, preserving its completeness and uniqueness throughout the process.
Library Construction: The UMI-tagged RNA is reverse transcribed into cDNA, preserving the UMI sequence during this transition. Subsequently, PCR amplification is carried out to generate a sufficient quantity of cDNA. Amplified PCR product is then transformed into a sequencing library, which involves appropriately modifying termini and connecting adaptors to enable sequencing device recognition. It's crucial to optimize and standardize PCR reaction conditions during this process to efficaciously transcribe and amplify the UMI-tagged cDNA. Performing multiple rounds of PCR amplification may be necessary to acquire sufficient cDNA while avoiding contamination and bias.
Sequencing: The constructed library is then processed for high-throughput sequencing on platforms like Illumina sequencer to acquire RNA sequencing data of each sample. Ensuring adequate sequencing depth to cover the range of RNA quantities in the sample is important during this stage.
Data Analysis: This majorly comprises of preprocessing, alignment to the reference genome, UMI counting, differential expression analysis, functional analysis, and more. Tools like Trim Galore or Cutadapt are commonly used for quality control and trimming of raw sequencing data to remove low-quality sequences and adapter sequences. Then, alignment tools like STAR or HISAT2 align the trimmed sequencing reads to a suitable reference genome or transcriptome. The most crucial step is UMI counting, which helps determine the origin of each RNA molecule in each sample. This usually involves using applications like UMI-tools for clustering and counting UMI sequences. Subsequently, standard differential expression analysis is conducted on the UMI count data to identify genes with significantly different expression levels under various conditions. Based on these results, differentially expressed genes can undergo functional enrichment analysis, pathway analysis and so on to comprehend their potential functions and interactions in biological processes. Lastly, visual representation of the analysis results, such as heat maps, volcano plots, PCA plots, etc., are created for intuitive display of experimental outcomes.
UMIs offer not only precise quantification of initial molecule counts but also alleviate errors stemming from library preparation, sequencing, and the non-uniformity introduced by PCR amplification. Within the vast dataset of high-throughput sequencing, UMIs facilitate the discrimination of DNA fragments originating from the same source. Through the pairwise alignment of multiple DNA fragments from the same origin, UMIs effectively discern false-positive mutations introduced during library construction, thereby enhancing the sensitivity and specificity of mutation detection, particularly for ultra-low abundance mutations.
Precise Quantification of RNA Levels: Utilizing UMIs allows for the precise calculation of the copy number of each RNA molecule, thereby facilitating more accurate measurements of RNA expression levels.
Detection of Low-Expressed Genes: The capability of UMI technology to distinguish different copies of the same RNA molecule endows UMI RNA sequencing with heightened sensitivity in detecting low-expressed genes and transcripts with low abundance. This empowers researchers to conduct more reliable analyses of small RNA samples or explore the functions of low-expressed genes with greater confidence.
Mitigation of PCR Amplification Bias: Traditional RNA sequencing can introduce bias during the PCR amplification process, leading to over amplification of certain sequences while others are inadequately amplified. However, UMI technology accurately distinguishes bias introduced by PCR amplification through unique identifications provided by the UMI sequences, thus making the data more truthful and reliable.
Handling RNA Degradation: Given that UMIs can mark the initial copies of RNA, they are extremely robust in addressing RNA degradation in RNA-Seq samples. This affords a consistent and reliable set of data.
Analysis at the Single-Molecule Level: In single-cell RNA sequencing, UMI technology holds particular significance. UMI sequences accurately tag each RNA molecule within individual cells, unveiling transcriptional features and cellular heterogeneity at the single-cell level, thus providing critical tools for single-cell research.
Data Reproducibility and Comparability: UMI RNA sequencing technology exhibits robust data reproducibility and comparability. Through precise counting of UMI sequences, data from different samples can be directly compared, unaffected by PCR amplification biases and sequencing depth variations.
Broad Applications: UMI RNA-seq technologies have found expansive applications in fundamental biological research, clinical diagnostics, and drug development among other domains. The prospects of its application in areas such as low input RNA sample analysis, single-cell transcriptomics, and drug screening are particularly vast.
Gene Expression Analysis: UMI RNA-Seq is widely deployed for gene expression analysis, inclduing exploring gene expression patterns under distinctive biological conditions, in varying tissue types, or under varying disease statuses.
Single-cell RNA Sequencing (scRNA-seq): Single-cell RNA sequencing technology affords the elucidation of cell types, states, and the heterogeneity of gene expression at the individual cell level. The application of UMI RNA-Seq in scRNA-seq enables accurate labeling of RNA molecules in each cell and high-resolution quantitative analysis of the transcriptome of single cells. This facilitates the identification of cell types, molecular markers, and gene expression patterns, and reveals mechanisms of cellular development, differentiation, and disease initiation.Drug Screening and Mechanistic Studies of Drug Action: UMI RNA-Seq can be effectively utilized to assess the impact of therapeutics on gene expression, revealing the molecular mechanisms underpinning drug action. By comparing the transcriptomic profiles of drug-treated and control groups, it is possible to identify gene expression changes instigated or suppressed by pharmaceutical interventions. This, in turn, can shed light on the mechanism of action, side effects, and drug tolerance of pharmaceutical agents. Moreover, UMI RNA-Seq can also be employed for drug screening and monitoring therapeutic efficacy, providing a robust approach to evaluating how pharmacotherapies affect gene expression and monitoring treatment-associated changes in transcriptional activity.
Developmental Biology Research: UMI RNA-Seq represents a robust method for investigating the dynamic alterations in gene expression across divergent developmental stages and tissues. By comparing RNA expressions at varied points of time and tissue types, it permits the identification of expression patterns and regulatory networks of specific genes, thereby unravelling the molecular processes and signalling pathways inherent to developmental stages.
Disease Research: UMI RNA-Seq is also exceedingly valuable in exploring the etiology and progression mechanisms of various diseases, including tumours, neurodegenerative diseases, and autoimmune disorders. By comparing RNA expression profiles between healthy and diseased groups, alterations in gene expressions related to disease can be identified, along with potential biological markers. This ultimately offers fresh insights and opportunities for disease diagnosis, projection, and treatment strategies.
Functional Genomics Analysis: The UMI RNA-Seq methodology allows for the identification of potential functional genes and regulatory elements by analyzing transcriptional expression patterns, alternative splicing, and transcription start sites.
Research on Environmental Adaptation and Evolution: The application of UMI RNA-Seq in comparing RNA expression profiles under diverse environmental conditions facilitates recognition of gene expression alterations and molecular adaptive mechanisms linked to environmental adaptation. This approach elucidates the genetic foundations of biological adaptability and contributes to understanding the adaptive processes and evolutionary progression under varying environmental conditions.
References: