RNA sequencing, or RNA-Seq, is a powerful molecular biology technique that provides comprehensive insights into the transcriptome of an organism.
For instance, gene3 might showcase heightened activity in normal cells, while gene2 displays elevated expression in mutant cells. Meanwhile, gene1 exhibits consistent expression levels across both cell types. High-throughput RNA sequencing technology gauges the transcript abundance of various genes within the cells, revealing which genes are actively transcribed.
What is RNA Sequencing (RNA-Seq)? may be a useful article to learn about RNA-seq.
Reads, also known as downstream sequences, are sequences of nucleotide bases obtained from a sample's nucleic acid fragments analyzed by a sequencer, represented as strings like "ATCAGATA.....".
RNA-Seq reads are sequences obtained from RNA molecules in a sample, typically generated through RNA sequencing techniques. These reads represent fragments of RNA molecules and are crucial for understanding gene expression, alternative splicing, and other RNA-related processes. Similar to genomic sequencing reads, RNA-Seq reads also consist of nucleotide bases, and their analysis provides valuable insights into the transcriptome of an organism under specific conditions or treatments.
The read length denotes the number of bases in each read. For instance, when we say a read is 50 base pairs (bp) long, it means that it consists of 50 measured bases in a single sequence.
Read depth refers to the quantity of reads acquired through sequencing a sample. It's often conflated with genome sequencing coverage, which pertains to the extent of genomic regions sequenced, and sequencing depth, representing either the frequency of sequencing a single nucleotide or the average depth across all nucleotides sequenced.
The number of reads reflects the volume of data generated through sequencing, often expressed as entries. In the context of applications like metagenomic Next-Generation Sequencing (mNGS), the number of reads serves as a crucial metric for detecting specific pathogenic microorganisms, aiding in their characterization and quantification.
CD Genomics high-throughput RNA sequencing and library construction services enable in-depth analysis of transcriptomes. CD Genomics provides robust transcriptome research service down to single-cell input levels in high-quality samples.
Step 1: RNA Extraction
The RNA from the sample of interest is extracted.
Step 2: Fragmentation of RNA
RNA molecules, typically thousands of bases long, are fragmented into smaller pieces. This fragmentation is necessary because the read length of the sequencer is limited (usually 200 to 300 bp), allowing for sequencing.
Step 3: Reverse Transcription
The fragmented RNA is reverse transcribed into complementary DNA (cDNA). Double-stranded DNA is more stable than RNA and is easier to amplify and manipulate.
Step 4: Addition of Sequencing Adapters
Sequencing adapters are added to the ends of the double-stranded DNA. These adapters contain sequences that are complementary to those on the sequencer chip, allowing the sequencer to recognize and sequence the DNA fragments efficiently. Different samples may use distinct adapter sequences, enabling multiplexing of samples in a single sequencing run. It's important to note that the efficiency of adapter addition may vary, leading to some DNA fragments not being recognized by the sequencer.
Step 5: PCR Amplification
PCR amplification is performed using primers designed based on the added adapter sequences. This amplification step selectively amplifies DNA fragments that contain the adapter sequences.
Step 6: Quality Control
The concentration and length of the constructed library are determined to ensure optimal sequencing performance. Libraries with appropriate concentration and length are selected to proceed with sequencing.
Step 7: Sequencing
The constructed library is subjected to sequencing using the chosen sequencing platform.
Post-sequencing, the dataset typically comprises approximately 400 million RNA-seq reads, each consisting of four rows. Before analysis, it's essential to preprocess this data.
1. Data Preprocessing: Filtering of Substandard RNA-seq Reads
Substandard RNA-seq reads, characterized by low-quality base recognition or compound interference, must be filtered out. In normal conditions, an RNA-seq fragment comprises two sequencing junctions and a DNA fragment. However, under abnormal conditions, an RNA-seq fragment may consist of only two sequencing junctions.
2. Alignment of High-Quality RNA-seq Reads to the Genome
The genome's extensive base sequence necessitates its fragmentation into numerous short base sequences, which are indexed and their chromosomal locations recorded. Similarly, RNA-seq reads are fragmented into small segments. These read fragments are then aligned with the corresponding fragments of the genome. By matching the small fragments of RNA-seq reads to those of the genome, the chromosomal location of each read fragment can be inferred.
3. Counting Reads per Gene
Once the chromosomal location of each RNA-seq read is determined, it becomes possible to ascertain whether a read falls within a specific gene. For instance, by knowing the coordinates of genes like Xkr4 (Chromosome 1, position: 3204563-3661579) and Rp1 (Chromosome 1, position: 4280927-4399322), the count of reads located at these coordinates can be tallied, yielding read counts for genes. This process enables the construction of a read count matrix.
4. Sequencing Data Normalization
As different samples may be compared against varying numbers of reads in the genome, discrepancies in read counts may arise. For example, Sample 1 might have 635 total reads, while Sample 2 has 1270, nearly double that of Sample 1. However, this doesn't indicate double the gene transcription in Sample 2. Instead, it signifies fewer low-quality reads in Sample 2, which are interpreted by the sequencer as more fluorescent. To accurately compare read counts and reflect gene transcription differences, read count data for each gene needs adjustment. Simple methods include dividing the read count value for each gene by the total read count for the sample. Alternatively, more complex normalization methods such as RPKM, FPKM, TPM, etc., can also be employed.