Submission of sequence data to NCBI archives
Next-generation sequencing, PacBio SMRT sequencing, and Nanopore sequencing, can generate numerous sequence data in a single run. Raw reads or assembled sequence need to be submitted to public sequence repository (DDBJ/ENA/GenBank - INSDC), which is required by the overwhelming majority of journals as accession numbers of theses sequence data should be presented in in the published papers. The submission portal (https://submit.ncbi.nlm.nih.gov/) is a programmatic interface for users to submit sequence data and download others’ sequence data. In addition to raw sequence data, you can also submit computationally assembled sequences, genomes, functional genomics data, microarray data, clinical data, genome variations, and other data types, such as PacBio methylation data. Submission to SRA, GEO, or dbGap or GenBank is all qualified as acceptable submissions. In this article, we will introduce how to submit sequence data to GenBank.
GenBank submission
GenBank (https://submit.ncbi.nlm.nih.gov/subs/genbank/) accepts thousands of new sequence submissions per month from researchers worldwide. The common submitted sequences include mRNA sequences with coding regions, ribosomal RNA gene clusters, fragments of genomic DNA, and a viral or organelle complete genome. You can submit a single sequence or sets of sequences. If part of the sequence encodes a protein, a coding sequence (CDS) feature and resulting conceptual translation should be annotated. Each submitted sequence is assigned an accession number for the sequence records, usually within two working days. Submitters and users can view each sequence or a set of sequences that are classified based on biological relatedness. Each set is contained in Entrez PopSet (https://www.ncbi.nlm.nih.gov/popset/), allowing researchers to view the relationship within the set through an alignment.
You can directly submit ribosomal RNA (rRNA), rRNA-ITS, or Influenza sequences to GenBank. Other sequence types should be submitted with one of the alternative tools. For unassembled raw sequence reads, you can submit them to the Sequence Read Archive (SRA).
BankIt (https://www.ncbi.nlm.nih.gov/WebSub/?tool=genbank), a WWW-based submission tool, accepts all standard GenBank submission except: (i) sequences with an alignment (you can use Sequin), (ii)raw read data (you can use Submission Portal-SRA), (iii) transcriptome shotgun assembly data (you can use Submission-TSA), and genome data (you can use Submission Portal-Genomes), and (iv) rRNA, rRNA-ITS, or Influenza sequence (you can use Submission Portal).
Tbl2asn (https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/) is a command-line program that combines input sequences and tables to produce files appropriate for GenBank submission. The input files include sequences in FASTA format, organism information, and feature annotation. Submission made with Tbl2asn should be mailed to gb-sub@ncbi.nlm.nih.gov.
Sequin (https://www.ncbi.nlm.nih.gov/Sequin/) is a stand-alone application to guide users through the submission processes. Sequin can be used to submit sequences or small complete genomes. With this tool, annotation and analysis of nucleotide sequences can be conducted with this tool. If you would like graphical viewing and editing options such as alignment editing, Sequin is a good choice. Submission made with Sequin should be mailed to gb-sub@ncbi.nlm.nih.gov.
When submitting multiple, related sequences, both Tbl2asn and Sequin can accept the output of popular sequence-alignment packages such as PHYLIP, NEXUS, and FASTA + GAP. The alignments contribute to sequence annotation in the set.
After submission
After GenBank submission, the GenBank annotation staff will check the following issues:
(i) The sequence length and molecule type (single molecule type or a mix of mRNA and genomic DNA).
(ii) Biological validity.
(iii) Is the sequence free of vector contamination?
(iv) If the sequence is published, a PubMed ID can be added to the record in order that the sequence and the publication can be linked.
(v) Formatting and spelling.
If there are any problem involved, the annotator will contact the submitter by email for correction.
CD Genomics has a team of bioinformatics professionals who deal with quality control of raw reads, sequence alignment, genome assembly, genome mining, and comparative genomic studies. If you have any questions about data processing, please feel free to contact us.