The Human Genome Project (HGP) was initiated in 1990 to achieve two central goals: 1) to analyze the structure of human DNA and 2) to locate all human genes. Recently, we have successfully achieved the first goal of obtaining a complete and contiguous DNA sequence of the human genome. However, achieving the second goal has been much more complex than initially anticipated, although we have gained a much better understanding of the location and function of thousands of human genes.
Scientists from several countries revisited the goals of the Human Genome Project and delved into progress, challenges, and responses to the four specific efforts to complete the annotation of human genes in the coming years:
(1) Completing a list of protein-coding genes and their various isoforms.
(2) Completion of a complete list of RNA genes including various lengths and types.
(3) Identifying and linking specific diseases associated with medically important genes and gene variants.
(4) Refinement of the techniques required to realize human gene annotation.
The annotation of protein-coding genes has been a focal point within the ambit of the Human Genome Project. Following the elucidation of gene sequences, the scientific community is progressively converging on a consensus regarding the identity of these genes, although the process of annotation remains a work in progress.
Initially estimated at 50,000 to 100,000 genes in the 1980s, these estimations have consistently decreased over time. The first publication of the human genome reduced the estimate to 30,000 to 40,000, followed by a further reduction to 25,000, and the current count stands at just under 20,000 genes. A recent database release, exemplified by GENCODE version 41 with 19,370 genes, underscores this continuous refinement. These adjustments are the result of manifold advancements, encompassing meticulous manual reviews, enhancements in computational annotation methods and analysis, and the escalating generation of high-quality experimental transcription data. Despite the overall decrease in gene numbers, the ongoing identification of new protein-coding genes and alternative isoforms of known genes persist.
A noteworthy collaborative effort, known as MANE (Matched Annotation from the NCBI and EMBL-EBI, Ensembl/GENCODE, and RefSeq), has recently introduced a nearly comprehensive dataset featuring one isoform for each protein-coding gene. This initiative has achieved consensus between two leading annotation projects, RefSeq and GENCODE. MANE 1.0 comprises 19,062 gene loci, encapsulating 95% of the total number of protein-coding loci in major human gene catalogs.
Non-coding RNA genes (ncRNAs) constitute a category of RNA molecules transcribed from DNA, devoid of protein-coding capacity yet crucial for cellular functions. Identifying functional ncRNAs poses a significant challenge in annotation, as numerous transcribed RNA sequences may lack functional relevance under diverse cellular and environmental conditions. The term "genes" is reserved for RNAs with established functionalities, thereby narrowing the scope of annotation efforts. Presently, most annotation endeavors center around exhaustively cataloging ncRNA transcripts, overlooking their functional classification.
An inherent challenge in annotating ncRNAs lies in assigning functional labels. In contrast to protein-coding genes, where ample a priori functional evidence exists, and robust computational methods based on primary sequence information facilitate function prediction, the scenario is markedly different for ncRNAs. Our understanding of these molecules is limited, and validated methods for predicting their functions based on sequence alone are lacking. Consequently, recent efforts in ncRNA gene annotation aim to delineate the various types of evidence supporting them, such as tissue-specific expression levels, even when their functional roles remain elusive. The emphasis is on characterizing diverse facets of evidence, acknowledging the complexity of non-coding RNA functionality.
The annotation of human genes holds crucial implications for the diagnosis and treatment of genetic disorders. Within the comprehensive OMIM catalog, over 5,000 genes and a multitude of variants are linked to single gene disorders and disease susceptibility, exemplified by the BRCA1 gene's over 34,000 variants documented in the BRCA Exchange database. Notably, 2,228 of these variants are designated as pathogenic.
The accuracy and comprehensiveness of gene and transcript models play a pivotal role in evaluating the pathogenic potential of variants. Tools like PolyPhen, Revel, and Variant Effect Predictor (VEP) rely on predicted open reading frame transcripts to determine variant effects. Moreover, the precision of exon boundary annotation is essential for designing oligonucleotide decoys and PCR primers used in clinical diagnostic analyses employing targeted capture sequencing. Even in the context of whole genome sequencing (WGS) for diagnostic purposes, unannotated exons are typically excluded from consideration by clinicians.
The predominant challenge in this domain revolves around the establishment of a clinical standard. Presently, clinical laboratories predominantly operate on GRCh37 (hg19) human assemblies, utilizing RefSeq transcripts as a reference for disease-associated genes, often based on literature reports. However, this approach faces two significant issues: first, not all RefSeq transcripts seamlessly align with the GRCh37 human reference genome, and second, the chosen transcripts may not necessarily embody the features crucial for clinical diagnosis or represent the most pertinent transcripts for interpretation. The development of a robust clinical standard is imperative to enhance the precision and reliability of genetic annotations in the clinical realm.