How to Guide for Genbank Sequence Data Submission, Treebase and Dryad

By Peter Unmack

Many journals now require that your data be deposited in one to three online databases. GenBank, Dryad and TreeBase. The following provides brief descriptions of Dryad and TreeBase submissions and detailed instructions for how to submit DNA sequences to GenBank. If you find any errors, or have suggestions for improvements please email me at .

Dryad submission

Dryad submissions are extremely straight forward, quick and easy. Dryad should be used to submit your sequence dataset(s) that you analyzed in your paper. It is especially important to submit datasets if you have aligned your sequences in any way (GenBank does not preserve any aspect of your alignment), if you used phased data, or if you collapsed your sequences to haplotypes (e.g., it is nice to submit a file with all of your individual sequences as well as only the haplotypes). If you created an Arlequin file, or something that takes a bit of time and effort to create then this should also be uploaded to Dryad.

The easiest way to submit to Dryad is to wait until your paper is accepted, then ask the journal to populate a submission for you (not all journals do this, but many can). That way you don't have to enter any of the bibliographic information. There are three stages of submission. Page 1 is the bibliographic information. In Page 2 you upload a datafile, provide a short description and for more complicated datasets provide a readme file. In Page 3 you can upload another dataset, which sends you back to a new copy of Page 2. Once finished you submit your files which will be reviewed and approved within a couple of days. Once approved you will get a doi that you should reference in the final version of your paper. I usually add the final doi in the final proof phase.

Some Dryad partner journals have chosen to allow reviewers access to the data during the article's peer review process. In such cases the author could be asked to deposit their data in Dryad at the time they submit their article for review. If their article is not accepted by the journal, their data files can remain in Dryad for a year, and could be associated with an article submitted to or accepted by another journal, saving them the need to re-upload. Diagrams of both Dryad's basic and review workflows are at http://wiki.datadryad.org/Submission_Integration. These workflows pertain to journals that have implemented manuscript submission with Dryad.

If you have questions about any aspects of Dryad then check their FAQs for depositors at http://www.datadryad.org/pages/faq#depositing and/or send questions about Dryad deposition to help@datadryad.org.

TreeBase submission tips

Increasingly more journals are requiring you to submit any trees from your article into TreeBase. This is more complicated than Dryad, but relatively easy, although the instructions online are a bit thin. They have some instruction videos that are helpful in explaining the process, but it takes a bit of time to watch it all. Once you have done a couple submissions and gotten familiar with the process then submission gets much quicker. To format your data they suggest using Mesquite and the process is relatively easy. You simply upload your nexus file (it is easiest to do this using a nexus formatted data file rather than phylip format as the results from the import get messed up if you use sequence names in your phylip file longer than 10 characters). Then upload your tree file (nexus or newick format). Note that taxon names must match. The painful part is renaming your OTUs as these should consist of the genus and species name. I usually add on additional details so that the data will match what is in the published tree (e.g., haplotype number, individual number, etc.). The key point here is when uploading, if you already have it formatted as genus species separate them with an underscore. Anything after the species name should be separated by something else (I use a period). But when you add taxon names in Mesquite separate them with a space. If you use an underscore Mesquite will put all taxon names in quotes which will mess up the taxon name lookup within TreeBase (if you do this just open the nexus file and search and replace all quotes). If you have multiple files such as different genes for the same group of taxa then be sure and name your data matrix and trees in mesquite with a distinct name as that will make doing your submission easier. The main quirky thing within TreeBase is that you must submit/save each time you finish entering data in a particular section or else is does not get saved.

GenBank submission

All of your sequence data gets submitted to GenBank. Some people will submit every individual they sequence, others will submit haplotypes. Note that if you phase your data, you should submit unphased data to GenBank.

It is important to note that there are many ways of doing what I describe below, both in terms of preparing fasta files as well as things within BankIt. BankIt is a somewhat cryptic tool. I have no idea what at least 90% of the "stuff" in there does or means, even after reading the help files. Hopefully you find my guide to be helpful and that it makes the submission process less painful.

Fasta file preparation.

These instructions were originally written for using Sequin for data submission, but that service was stopped in January 2021 and they will no longer accept data prepared in Sequin. So there may be a quirk or two, if you find something I've missed please let me know. Sequences are now submitted via the web interface called BankIt.

The idea here is to pre-enter as much information as possible in order to do as little as possible in the online system BankIt. Most of the instructions below are fairly detailed (which makes the instructions long), but once you have used this method a couple of times you will not need to refer to most of these details.

If you have multiple non-adjacent genes (i.e., not continuous stretches of sequence data), then create a separate fasta file for each one. If they are continuous then create a single fasta file.

The OTU or sequence names you use can be tricky in BankIt in some situations. If you have multiple individuals within a species it is often better to use the isolate code rather than the species name as sequin gets confused and may think they all have the same label, i.e., Melanotaenia_australis_I and Melanotaenia_australis_II will confuse Bankit and produce an error which messes up your submission.

How you label your initial file will depend on your circumstances and how much information you wish to included with your sequences. While most journals require that sequences are deposited in GenBank, unfortunately this does not mean that they must be clearly referable to what is in your paper. Your sequence name (OTU code) should match whatever is used in your tree, haplotype network, etc, as published to be useful. In some cases this may be tricky (e.g., if multiple OTUs in your tree have the same label), but most sequence names should be easily related to what is in your paper.

I minimally try and include the sequence name, my own DNA code for that specific sample (which is for my own benefit), locality data (from what I provide in the paper) and any museum specimen codes. I provide those details irrespective of whether it is a phylogenetic dataset based on species names or phylogeographic dataset based on haplotype numbers.

These steps are all made easier if you rename the sequences in your datafile prior to making your tree, that way the names will already be the same, or very close to it (plus if you have to rerun your analysis then you don't have to re-label all of your OTUs). In the process of renaming OTUs I always save a version which has both the original label (for which I use the DNA sample code) and the final label. This makes the GenBank submission much easier. This can easily be done for haplotype datasets if you use MacClade (which is a bit defunct now as it no longer runs on newer macs) which when collapsing a dataset to unique haplotypes has an option to include the original OTU name with the new haplotype name.

I usually either first order sequences based on the species name or locality depending on the dataset (phylogeny vs. phylogeography). I find this is easy to do in the program BioEdit which can sort by sequence name or via dragging them or cutting and pasting sequences, plus you can easily paste over sequence names if you rename stuff from another source (like a species list in Excel). My experience is all based on BioEdit, but there is a chance that other programs may behave slightly differently when saving and opening fasta files that have extra information in the sequence name (which we add in before submitting to GenBank).

Open or paste your fasta file into word. Copy and replace ">" with ">^t" Select all (control-a), copy (control-c) and paste into excel (control-v). There's a couple ways to manage the data in excel, either use filters (but then you can't paste into multiple rows when filtered) or via sorting. These instructions assume sorting was used. If sorting insert a column before the first column and number it from 1-n. Select all, then sort by column B, then column A. This will separate your sequence names from the sequences and keep the sequence names in the order you originally had them in. You will need to cut and paste your sequences names from the bottom of the spreadsheet to the top. If you don't care about the order of the sequence names, then simply sort by column C, then column A and all the names will be at the top.

I usually add the following definitions to the first row and then drag them (from their bottom right corner) to fill in the other rows. Obviously the organism, isolate, country and specimen voucher needs to be changed appropriately to match each sample. This all involves manipulation in excel using the concatenate function and copying and pasting stuff around. Note that in the country field the country must be followed by a colon if you include additional locality data

[organism=Amniataba affinis]

[molecule=DNA]

[moltype=genomic]

[location=mitochondrion] or [location=genomic] (if you have nuclear DNA. It is different for each genome type)

[Isolate=Aaffi.Fly.1] (that is my DNA code for that individual)

[Country=Papua New Guinea: Fly R.] (locality data, note that you must include the : after the country name to separate it from the remaining extra details, the field must start with the name of the country.)

[specimen voucher=KU:I:2943:29881] The proper format for this is [specimen voucher=institution code:collection code:specimen_id]. If you have multiple vouchers just repeat the specimen voucher field (i.e., [specimen voucher=KU:I:2943:29881] [specimen voucher=KU:IT:T1682]) (the first is the formalin preserved voucher, the second represents the tissue sample number). Institution names and codes can be found at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/institution_codes.txt. This will allow GenBank to directly link to the voucher specimen in online versions of the museum catalog when available.

For a more phylogenetic dataset I use something like this for the country field.

[Country=Australia: SA, Bray Drain, Robe Naracoorte Rd, Pop. 1]

The full range of modifier options are found at https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html#modifiers

These are all manipulated in excel and then added after each sequence name. I usually create an excel sheet with details from my locality table and use the concatenate function to create the entries. Here is an example of how I make the organism field (for a phylogenetic study) using concatenate. The upper box shows the function formula, while cell E1 shows what that looks like. You can then copy and paste special (values) to the sheet with the sequences.

Thus your file should start to look like this, with columns D, H and I filled in from the excel sheet above.

The final version should look like this (note: the specimen voucher fields are not in the proper format).

Once you have the fields filled in, you can then resort the excel file by column A, which puts everything back in the correct order. Select all the cells that have data in them (hit control-end, then shift-control-home to select them all, then shift-right arrow [one click] to remove column A from the selection). Hit control-c to copy it and then go to Word, start a new document, do a paste special, text only (see image below). Search and replace "^t>" with ">", then search and replace "^t" with " ", then search and replace two spaces for a single space until no more get replaced (see image below as to what it should look like now). Now save as plain text and give it the extension .fas (you may have to manually rename the file extension after closing it). Open that file in BioEdit or your favorite program to double check that the labels and sequences look good. Now your file is ready to import into BankIt.

It will look like this when first pasted into Word (note it's helpful to have the show characters option clicked so you can see tabs and spaces, hit control-shift-* to toggle that option)

Once you have searched and replaced it should look like this in Word.

BankIt

Start at https://www.ncbi.nlm.nih.gov/WebSub/ and login.

In the first screen select " Sequence data not listed above (through BankIt)" as this will be the option that most people need, then click on start.

In the second screen click on start BankIt submission

The next screen shows a series of tabs with each step starting with Contact. All your info should be there if you have done a previous submission, if not add it.

In the Reference tab add the sequence authors and the reference information if you have them.

Under Sequencing Technology I've only ever entered the option Sanger dideoxy sequencing. That's the only box I click on for this page.

The next screen is the Nucleotide tab. Enter the relevant details. I usually release the sequences Immediately After Processing. In Sequence(s) and Definition Line(s), under Molecule Type: I select Other Genetic: DNA for mitochondrial DNA, select genomic DNA for nuclear genes. I leave the other values as is. For Nucleotide Sequence Format I always choose Alignment. Note that for this option all sequences must be the same length in the alignment, if not then either fix it or use the other option. Note if you need to propagate features you must use Alignment as the option here.

If you have any errors you should try to fix them, examples might be mis-spelling of species names. If the names are correct, but not in their database, or they are undescribed species you can click on continue to keep going.

On the Set/Batch tab I usually always choose Phylogenetic study or Population study.

In the Submission Category tab it will nearly always be Original.

There should be nothing to add under the Source Modifiers tab as that is all in the fasta file you uploaded.

It will then go to the Feature Propagation Option tab and ask if you want to you want to use feature propagate. For a single intronless gene say no. For more complicated situations select yes if you have multiple genes in a fragment or introns present and see the instructions below.

Under the Features tab select Add features by completing input forms. That will pop up more options, I always use Coding Region (CDS) / Gene / mRNA. That will pop out an additional option, select providing intervals then click add. That will bring up a new page called Features (Detail). You have to know something about your gene here to fill this out. I add the name of the gene in the protein name and gene name fields lower down. Click on Accept, it will then ask you to specify the genetic code. Once you get through that screen you'll see the Features (Overview) screen. For simple genes with no introns you don't need to do anything there.

The next screen is the Review and Correct tab where you can click finish submission and you are done (if everything was correctly entered, GenBank staff will let you know if it isn't!).

If I have more sequences to add then I go back until the Nucleotide tab and start again from that point. It will give your submission a new number when you replace the fasta file. A box will pop up warning you that you will erase information you have added, click ok.

More complicated situations

For sequences that span multiple genes, or that have exons and introns you need to define the characteristics of the alignment.

The first thing to do is to figure out what your regions are! I usually grab a previously curated sequence from GenBank and use that as a guide. There are a couple of things that many pcrers may be unaware of though relative to exon/intron boundaries.

Exon/intron boundaries usually have a specific sequence, the last two bases of the exon will be AG, the first two bases on the intron will be GT. The intron will end with a AG, and the next exon will start with a GT. This is not universal though. Note that when trying to figure out the codon positions in each exon, the exon/intron boundary is not related to codon position, that is the boundary can be at coding position 1, 2 or 3. So the next exon may not start on a whole codon, but may be partial, with them getting joined after transcription. To further complicate things, not all genes are translated from the start of the first codon!

The following two examples cover adjacent genes followed by exon/introns.

Adjacent genes

So under the Feature Propagation Option tab you selected yes to use feature propagate, then selected continue. For adjacent or overlapping genes with a continuous alignment, first select a sequence to add the features to (I usually choose the first one assuming they are all complete sequences). New options will pop up, select Add features by completing input forms, select Coding Region (CDS) / Gene / mRNA and select Add CDS by providing intervals, then click on add. That brings you to a new screen. Under Nucleotide Interval Spans: select Specific Spans - specify nucleotide numbers within your sequence. (Use this if your sequences contain introns). [ignore the bit about introns--you need to use this option for overlapping/adjacent genes too!] Put in the values for the start and end of the gene, add the name of the gene into both boxes (for Protein Name and Gene Name), then click on accept (assuming none of the other options are relevant to you). Click on Accept, it will then ask you to specify the genetic code. Then repeat the steps above. Once you have added each region click accept, it will pop up a box checking that you have added all features, click ok. That will pop you to the Features tab, scroll down (might be a long way) until you find the continue button and click it. Then it will take you to the final Review and Correct tab assuming there are no errors found (good luck with that!).

One error I have crop up sometimes is there will be a premature stop codon. As far as I can tell the only way around this is to go to that individual sequence and modify it and put the last base of the gene at the end of the premature stop codon. Once you click on continue as per the last step above it will pop up the error. If you can't find the stop codon you have to go back to the Feature Propagation Option tab, scroll down and view the translation for that protein from the problematic sequence and look for the * (which indicates a stop codon). You can then select that specific sequence, modify the record, click continue and you are then at the final Review and Correct tab.

A gene with exons and introns

For a region of a gene with introns and exons first select a sequence to add the features to (I usually choose the first one assuming they are all complete sequences). New options will pop up, select Add features by completing input forms, for exons select Coding Region (CDS) / Gene / mRNA and select Add CDS by providing intervals, then click on add. That brings you to a new screen. Under Nucleotide Interval Spans: select Specific Spans - specify nucleotide numbers within your sequence. (Use this if your sequences contain introns). Put in the values for the start and end of the gene [NOTE: it doesn't use the values of your alignment, thus you need to take out all gaps to get the right numbers, but when it propagates it does so using the alignment values (go figure...)], add the name of the gene into both boxes (for Protein Name and Gene Name), also under Does gene interval extend beyond the intervals provided for the coding region? click on yes and give the total span of the gene region including introns and exons, then click on accept (assuming none of the other options are relevant to you). Then repeat the steps above, but select other (instead of Coding Region (CDS)) then intron. Once you have added each region click accept, it will pop up a box checking that you have added all features, click ok. That will pop you to the Features tab, scroll down (might be a long way) until you find the continue button and click it. Then it will take you to the final Review and Correct tab assuming there are no errors found (good luck with that!).

Back to Unmack's Molecular Phylogenetics page.