How to Download KEGG Database: A Comprehensive Guide for Researchers

CD Genomics Blog

Explore the blog we've developed, including genomic education, genomic technologies, genomic advances, and genomics news & views.

Posted on December 25, 2024

Introduction: What is KEGG and Why is it Important for Researchers

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is an advanced computational tool in molecular biology research. Originated by Kanehisa Laboratories during 1995, this innovative database emerged as a comprehensive genomic repository, dedicated to advancing deep understanding of complex biological systems. Through continuous evolution, KEGG has solidified its status as an indispensable investigative instrument for global scientific communities.

Overview of the KEGG Database. (Image source: Official KEGG website)

KEGG Overview. (Image source: KEGG official website, https://www.genome.jp/kegg/kegg1a.html)

Architectural Components

Two fundamental database segments define KEGG’s computational infrastructure:

1. KEGG ORTHOLOGY (KO) is a system that classifies genes based on their similar functions across different biological systems. By categorizing genes into Ortholog Groups, researchers can:

Identify conserved genetic functionalities
Facilitate comparative genomic investigations
Enable cross-species molecular analysis

2. KEGG PATHWAY: A graphical representation framework depicting complex biochemical interactions, stratified across six primary categorical domains:

Cellular Processual Mechanisms
Environmental Interaction Pathways
Genetic Information Transmission
Pathological System Representations
Fundamental Metabolic Networks
Comprehensive Organismal Systems

Organized Pathway Structure

KEGG’s pathway architecture demonstrates remarkable computational sophistication:

Primary Level: 43 foundational pathway categories
Secondary Level: Detailed metabolic mapping
Tertiary Level: Molecular interaction annotations

Why is KEGG Important for Researchers?

For contemporary researchers, KEGG transcends traditional database functionality. It provides:

Integrated genomic and functional insights
Systematic exploration of molecular interactions
Visual representations of biochemical networks
Comprehensive disease mechanism analysis

By synthesizing genomic sequences with functional interpretations, KEGG empowers scientific investigation across multiple disciplinary boundaries.

KEGG emerges not merely as a database, but as a transformative computational framework enabling profound molecular comprehension. Its capacity to bridge genetic complexity with functional understanding positions it as a quintessential research instrument in contemporary life sciences.

Overview of the KEGG Database

The KEGG Database stands as a versatile tool indispensable for biological and biomedical research. It provides a wide array of data spanning multiple informational categories. Below, the database’s components are meticulously detailed, offering a comprehensive view of the diverse types of data encapsulated within KEGG.

A schematic representation of the KEGG NETWORK database. (Minoru Kanehisa et al., 2019)

A conceptual diagram of the KEGG NETWORK database. (Minoru Kanehisa, et al., 2019)

Classification and Database Categories

The constituent databases under KEGG can be categorized as follows:

Classification Type	Database	Description
System Information	KEGG PATHWAY	KEGG Metabolic Pathway Maps
	KEGG BRITE	BRITE Functional Hierarchies
	KEGG MODULE	KEGG Functional Unit Modules
Genomic Information	KEGG ORTHOLOGY	KEGG Ortholog Groups (KO)
	KEGG GENOME	Species with Complete Genomes in KEGG
	KEGG GENES	Catalog of Genes in Complete Genomes
	KEGG SSDB	KEGG Sequence Similarity Database
Chemical Information	KEGG COMPOUND	Metabolites and Other Small Molecules
	KEGG GLYCAN	Polysaccharides
	KEGG REACTION	Biochemical Reactions
	KEGG ENZYME	Enzyme Nomenclature
Health Information	KEGG DISEASE	Diseases
	KEGG DRUG	Drugs
	KEGG ENVIRON	Health-Related Substances
	KEGG NETWORK	Disease-Related Network Elements
	KEGG MEDICUS	Health Information Resources
	JAPIC	Japan Pharmaceutical Information Center Database
	DailyMed	FDA Drug Database (Link Only)

Specifics of the KEGG Pathway Database

The KEGG PATHWAY database is a pivotal and widely consulted section of the KEGG resource, representing a comprehensive collection of biological pathways. These pathways offer deep insights into metabolic networks, disease mechanisms, and other biological phenomena. Each pathway is assigned a distinct identifier and is classified into specific types, each representing unique aspects of biological data:

Pathway Type	Description	Example
map	Reference pathways that summarize and represent well-established biological knowledge.	map00010 (Glycolysis)
org	Species-specific pathways that represent pathways in a specific organism, substituting KO genes with corresponding genes in the species.	hsa00010 (Human Glycolysis)
ko	KO pathway types where each point represents an orthologous gene (KO entry).	ko00010 (General Metabolic Pathway)
ec	EC pathway types where each point corresponds to an enzyme classification (EC number).	ec1 (Oxidoreductases)
rn	Reaction pathway types where each point represents a specific chemical reaction.	rn01234 (Amino acid metabolism)

Key Types of Data Available in the KEGG Database

The KEGG database represents a sophisticated computational infrastructure for biological research, strategically organized into multifaceted investigative components. Each segment provides unique insights into molecular complexity:

PATHWAY Component

A pivotal repository of graphical molecular interaction representations, the PATHWAY module comprehensively documents metabolic, signaling, and physiological processes. These intricate visual mappings enable researchers to explore sophisticated biochemical interactions and their fundamental biological contributions.

BRITE Hierarchical Classification

This innovative organizational system categorizes biological functions through hierarchically structured classifications. BRITE facilitates sophisticated management of expansive genomic datasets, providing nuanced insights into molecular entity interactions across comprehensive biological systems.

MODULE Subpathway Analysis

Focusing on evolutionarily conserved biochemical reaction modules, this component enables comparative genomic investigations. Researchers can efficiently identify core reaction mechanisms persistently observable across diverse organismal systems, illuminating fundamental molecular conservation principles.

GENES Comprehensive Repository

An extensive catalog documenting genetic information from multiple organisms, the GENES section provides:

Detailed functional annotations
Associated molecular pathway connections
Comprehensive sequence data

This resource serves critical roles in:

Genome-wide association studies
Advanced genotyping investigations
Molecular characterization efforts

By synthesizing these computational components, KEGG provides researchers with a robust analytical framework, supporting sophisticated investigations across genomic, systemic, and molecular research domains.

Guide to Downloading Data from the KEGG Database:

Downloading data from the KEGG Database is a systematic process tailored for researchers requiring structured datasets for in-depth biological analysis. Here is a concise guide to navigate this procedure:

Step 1: Access the KEGG Website

Begin by accessing the official KEGG portal. This website is the central access point for the diverse databases and resources that KEGG offers.

Step 2: Select the Appropriate Database or Dataset

Upon reaching the KEGG homepage, identify and select the database or dataset of interest. Options range from KEGG PATHWAY and KEGG GENOME to KEGG DISEASE, among others. Navigation through specific categories can be accomplished by selecting the relevant links within the KEGG Databases section.

Step 3: Utilize KEGG’s Download Tools

KEGG facilitates data retrieval through several methodologies, which are contingent on the specific data requirements:

REST API: This method allows automated data access, ideal for bulk downloads.
FTP Download: For extensive datasets, consider the FTP (File Transfer Protocol) option, enabling bulk data downloads from the comprehensive KEGG database.
Download Scripts: Tailored scripts provided by KEGG simplify access to particular data types, such as pathway maps, genetic sequences, or chemical data.

Step 4: Select the Desired Data Format

You may choose from several data formats, each serving different purposes:

JSON: A versatile format that supports programmatic manipulation of data.
Flat Files: These text-based (.txt) files allow for easy manual inspection and are structured tabularly.

Step 5: Download and Extract the Data

After finalizing your data format, proceed to download. For FTP or script-based downloads, adhere to specified instructions for extracting and organizing the files appropriately. Ensure that your system has the requisite tools or software to efficiently manage large datasets, especially crucial for genome-wide or pathway map data.

Step 6: Conduct Data Analysis

Post-download, leverage bioinformatics tools for comprehensive data analysis. This stage may involve examining the data for metabolic pathway dynamics, annotating gene functions, or facilitating disease modeling studies.

This guide is crafted to streamline the process of data acquisition from KEGG, ensuring an efficient pathway from data selection to analytical application in genomic or biomedical research.

(A) Pathway analysis using the KEGG database. (B) Enrichment analysis using SMPDB. (C) Metabolic network depicting key metabolites and significant pathways in the KEGG general metabolic pathway map. (Zhuang, F., et al., 2022)

(A) Pathway analysis based on the KEGG database. (B) Enrichment analysis based on SMPDB. (C) Metabolic network of the crucial metabolites and significant metabolic pathways in the KEGG general metabolic pathway map. (Zhuang, F., et al., 2022)

Applications of KEGG Data in Molecular Research

The KEGG database emerges as a critical computational resource for researchers investigating complex biological systems across genomics, pharmacology, and systems biology. By offering comprehensive molecular interaction maps and annotated pathway information, KEGG enables sophisticated analytical approaches in contemporary life sciences.

1. Pharmaceutical Target Identification

KEGG pathways play a pivotal role in pharmaceutical research, facilitating the systematic exploration of potential therapeutic interventions. Researchers leverage these intricate molecular network representations to:

Identify candidate molecular targets for therapeutic development
Analyze potential drug repurposing strategies
Comprehend intricate drug-target interaction mechanisms

A landmark study by Chen et al. (2015) demonstrated how pathway enrichment analysis could categorize pharmacological targets based on their underlying biological functionalities, providing researchers with a robust conceptual framework for drug discovery.

2. Disease Mechanism Elucidation

The database helps map molecular networks, which is key to understanding disease mechanisms. By integrating genetic variation data with signaling pathway information, researchers can:

Visualize genetic perturbations within molecular networks
Identify potential biomarkers
Understand disease progression at the molecular level

Kanehisa et al. (2019) introduced the KEGG NETWORK database, which enables sophisticated visualization of how genetic variations influence cellular signaling pathways.

3. Metabolomics and Genomic Integration

KEGG helps connect genomic data with metabolic processes. Researchers utilize the database to:

Interpret high-throughput experimental data
Map metabolic pathways across diverse biological systems
Correlate genetic information with metabolic functionalities

Kanehisa’s (2016) research highlighted the database’s utility in plant genomics, demonstrating its versatility across biological domains.

KEGG pathway analysis of proteomics data. (Li, Z., et al., 2020)

4. Omics Data Synthesis

Advanced bioinformatics tools now facilitate more comprehensive analyses by integrating KEGG data with multiple omics datasets. Innovative approaches, such as the "ggkegg" package introduced by Sato et al. (2023), enable:

Enhanced visualization of complex biological networks
Simultaneous analysis of transcriptomic and proteomic data
Streamlined pathway enrichment investigations

5. Oncological Research Applications

In cancer research, KEGG pathways provide crucial insights into tumorigenesis and disease progression. Researchers like Kim et al. (2018) have developed specialized systems, such as BRCA-Pathway, which:

Integrate genomic cancer databases
Visualize signaling network alterations
Enhance understanding of molecular mechanisms underlying cancer development

6. Computational Analysis Advancements

The emergence of specialized bioinformatics tools has significantly enhanced KEGG data analysis capabilities. Recent developments, exemplified by Pedersen et al. (2023), include:

Creation of dedicated analysis packages
Improved pathway visualization techniques
Simplified enrichment analysis protocols

KEGG goes beyond traditional databases, providing a resource that links genomic data, molecular interactions, and biological understanding. Its multifaceted applications continue to drive innovation across research domains, from pharmaceutical development to fundamental biological investigations.

KEGG Database: Current Statistics and Usage

As of December 2024, the KEGG database continues its dynamic expansion, maintaining its status as a cornerstone in the field of bioinformatics.

A KEGG Global Metabolic Pathway generated with the KEGGscape tool. (Nishida, K., et al., 2014)

A KEGG Global Metabolic Pathway generated with the KEGGscape app. (Nishida, K., et al., 2014)

Current Statistics:

Pathway Maps and Gene Catalogs:

KEGG hosts an extensive collection of 576 pathway maps, encompassing various metabolic, signaling, and biochemical pathways.
The database incorporates over 1.3 million references, offering a comprehensive dataset for diverse research initiatives.

Diversity of Organisms:

The database catalogs genes from over 56 million entries across a wide spectrum of organisms, facilitating comparative genomics and multi-species research. Additionally, there are 27,293 orthology groups, which are integral for identifying conserved gene functions across species.

Future Directions: What’s Next for KEGG

With the progression of biological sciences, the KEGG database is poised for several enhancements to support the research community more effectively.

Integration with Other Databases:
Future updates to KEGG are likely to include enhanced integration with complementary databases such as UniProt and Gene Ontology, enriching the accessibility and functionality of data for researchers.
Expansion into Metagenomics and Personalized Medicine:
As fields like metagenomics and personalized medicine gain prominence, KEGG plans to expand its resources to cater to these areas. This will involve providing more detailed genomic and functional data tailored to individual organisms and microbial communities.

If you want to learn about Gene Ontology, you can read the following article:

Comprehensive Guide to Gene Ontology (GO) Analysis and Its Applications in Genomics

Conclusion

In conclusion, KEGG remains an invaluable resource for advancing research across genomics, systems biology, and drug discovery. Its comprehensive compendium of biological pathways, genes, and chemical compounds furnishes critical insights necessary for elucidating complex biological systems.

To leverage KEGG’s powerful datasets, researchers are encouraged to download resources pertinent to their studies, whether it be analyzing metabolic pathways, exploring gene functions, or investigating potential drug targets. KEGG’s wide range of data can greatly advance scientific research.

For specialized bioinformatics services or genomic data analysis, consider exploring CD Genomics’ solutions. Our experts are equipped to assist with complex data interpretation and provide customized services tailored to your specific research objectives.

References

Chen, L., Chu, C., Lu, J., Kong, X., Huang, T., & Cai, Y.-D. (2015). Gene Ontology and KEGG Pathway Enrichment Analysis of a Drug Target-Based Classification System. PLoS ONE, 10(5), e0126492. https://doi.org/10.1371/journal.pone.0126492.
Kanehisa, M., Sato, Y., Furumichi, M., Morishima, K., & Tanabe, M. (2019). New approach for understanding genome variations in KEGG. Nucleic Acids Research, 47(D1), D590–D595. https://doi.org/10.1093/nar/gky962.
Kanehisa, M. (2016). KEGG Bioinformatics Resource for Plant Genomics and Metabolomics. In: Edwards, D. (eds) Plant Bioinformatics. Methods in Molecular Biology, vol 1374. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3167-5_3.
Sato, N., Uematsu, M., Fujimoto, K., Uematsu, S., & Imoto, S. (2023). ggkegg: analysis and visualization of KEGG data utilizing the grammar of graphics. Bioinformatics, 39(10), btad622. https://doi.org/10.1093/bioinformatics/btad622.
Kim, I., Choi, S., & Kim, S. (2018). BRCA-Pathway: a structural integration and visualization system of TCGA breast cancer data on KEGG pathways. BMC Bioinformatics, 19(Suppl 1), 42. https://doi.org/10.1186/s12859-018-2016-6.
Pedersen, T.L., & others (2023). ggkegg: analysis and visualization of KEGG data utilizing the tidygraph framework for network analysis in R.Bioinformatics. https://doi.org/10.1093/bioinformatics/btac457.
Minoru Kanehisa, Yoko Sato, Miho Furumichi, Kanae Morishima, Mao Tanabe (2019). New approach for understanding genome variations in KEGG, Nucleic Acids Research, Volume 47, Issue D1, 08 January, Pages D590–D595, https://doi.org/10.1093/nar/gky962
Zhuang, F., Bai, X., Shi, Y., Chang, L., Ai, W., Du, J., … & Hong, T. (2022). Metabolomic profiling identifies biomarkers and metabolic impacts of surgery for colorectal cancer. Frontiers in Surgery, 9, 913967. https://doi.org/10.3389/fsurg.2022.913967
Nishida, K., Ono, K., Kanaya, S., & Takahashi, K. (2014). KEGGscape: a Cytoscape app for pathway data integration. F1000Research, 3. doi: 10.12688/f1000research.4524.1
Li, Z., Li, X., He, X., Jia, X., Zhang, X., Lu, B., … & Dong, Z. (2020). Proteomics reveal the inhibitory mechanism of levodopa against esophageal squamous cell carcinoma. Frontiers in pharmacology, 11, 568459. https://doi.org/10.3389/fphar.2020.568459