A Guide to Sequence Data Submission

By Peter Unmack

Many journals now require that your data be deposited in one to three online databases. GenBank, Dryad and TreeBase. The following provides brief descriptions of Dryad and TreeBase submissions and detailed instructions for GenBank. If you find any errors, or have suggestions for improvements please email me at . For a printable pdf copy of this page please click here.

Dryad submission

Dryad submissions are extremely straight forward, quick and easy. Dryad should be used to submit your sequence dataset(s) that you analyzed in your paper. It is especially important to submit datasets if you have aligned your sequences in any way (GenBank does not preserve any aspect of your alignment), if you used phased data, or if you collapsed your sequences to haplotypes (e.g., it is nice to submit a file with all of your individual sequences as well as only the haplotypes). If you created an Arlequin file, or something that takes a bit of time and effort to create then this should also be uploaded to Dryad.

The easiest way to submit to Dryad is to wait until your paper is accepted, then ask the journal to populate a submission for you (not all journals do this, but many can). That way you don't have to enter any of the bibliographic information. There are three stages of submission. Page 1 is the bibliographic information. In Page 2 you upload a datafile, provide a short description and for more complicated datasets provide a readme file. In Page 3 you can upload another dataset, which sends you back to a new copy of Page 2. Once finished you submit your files which will be reviewed and approved within a couple of days. Once approved you will get a doi that you should reference in the final version of your paper. I usually add the final doi in the final proof phase.

Some Dryad partner journals have chosen to allow reviewers access to the data during the article's peer review process. In such cases the author could be asked to deposit their data in Dryad at the time they submit their article for review. If their article is not accepted by the journal, their data files can remain in Dryad for a year, and could be associated with an article submitted to or accepted by another journal, saving them the need to re-upload. Diagrams of both Dryad's basic and review workflows are at http://wiki.datadryad.org/Submission_Integration. These workflows pertain to journals that have implemented manuscript submission with Dryad.

If you have questions about any aspects of Dryad then check their FAQs for depositors at http://www.datadryad.org/pages/faq#depositing and/or send questions about Dryad deposition to help@datadryad.org.

TreeBase submission tips

Increasingly more journals are requiring you to submit any trees from your article into TreeBase. This is more complicated than Dryad, but relatively easy, although the instructions online are a bit thin. They have some instruction videos that are helpful in explaining the process, but it takes a bit of time to watch it all. Once you have done a couple submissions and gotten familiar with the process then submission gets much quicker. To format your data they suggest using Mesquite and the process is relatively easy. You simply upload your nexus file (it is easiest to do this using a nexus formatted data file rather than phylip format as the results from the import get messed up if you use sequence names in your phylip file longer than 10 characters). Then upload your tree file (nexus or newick format). Note that taxon names must match. The painful part is renaming your OTUs as these should consist of the genus and species name. I usually add on additional details so that the data will match what is in the published tree (e.g., haplotype number, individual number, etc.). The key point here is when uploading, if you already have it formatted as genus species separate them with an underscore. Anything after the species name should be separated by something else (I use a period). But when you add taxon names in Mesquite separate them with a space. If you use an underscore Mesquite will put all taxon names in quotes which will mess up the taxon name lookup within TreeBase (if you do this just open the nexus file and search and replace all quotes). If you have multiple files such as different genes for the same group of taxa then be sure and name your data matrix and trees in mesquite with a distinct name as that will make doing your submission easier. The main quirky thing within TreeBase is that you must submit/save each time you finish entering data in a particular section or else is does not get saved.

GenBank submission

All of your sequence data gets submitted to GenBank. Some people will submit every individual they sequence, others will submit haplotypes. Note that if you phase your data, you should submit unphased data to GenBank.

It is important to note that there are many ways of doing what I describe below, both in terms of preparing fasta files as well as things within Sequin. Sequin is a rather cryptic tool. I have no idea what at least 90% of the "stuff" in there does or means, even after reading the help files. Hopefully you find my guide to be helpful and that it makes the submission process less painful.

Fasta file preparation.

The idea here is to enter as much information as possible in order to do as little as possible in sequin. For a single gene submission, all I do in Sequin is set the genetic code and add the bibliographic information. Everything else is pre-entered as per the instructions below. Most of the instructions below are fairly detailed (which makes the instructions long), but once you have used this method a couple of times you will not need to refer to most of these details.

If you have multiple non-adjacent genes (i.e., not continuous stretches of sequence data), then create a separate fasta file for each one. If they are continuous then create a single fasta file.

The OTU names you use can be tricky in sequin in some situations. If you have multiple individuals within a species it is often better to use the isolate code rather than the species name as sequin gets confused and may think they all have the same label, i.e., Melanotaenia_australis_I and Melanotaenia_australis_II will confuse sequin and produce an error which messes up your submission.

How you label your initial file will depend on your circumstances and how much information you wish to included with your sequences. While most journals require that sequences are deposited in GenBank, unfortunately this does not mean that they must be clearly referable to what is in your paper. Your sequence name (OTU code) should match whatever is used in your tree, haplotype network, etc, as published to be useful. In some cases this may be tricky (e.g., if multiple OTUs in your tree have the same label), but most sequence names should be easily related to what is in your paper.

I minimally try and include the sequence name, my own DNA code for that specific sample (which is for my own benefit), locality data (from what I provide in the paper) and any museum specimen codes. I provide those details irrespective of whether it is a phylogenetic dataset based on species names or phylogeographic dataset based on haplotype numbers.

These steps are all made easier if you rename your datafile prior to making your tree, that way the names will already be the same, or very close to it (plus if you have to rerun your analysis then you don't have to re-label all of your OTUs). In the process of renaming OTUs I always save a version which has both the original label (for which I use the DNA sample code) and the final label. This makes the GenBank submission much easier. This can easily be done for haplotype datasets if you use MacClade which when collapsing a dataset to unique haplotypes has an option to include the original OTU name with the new haplotype name.

I usually either first order sequences based on the species name or locality depending on the dataset (phylogeny vs. phylogeography). I find this is easy to do in the program BioEdit which can sort by sequence name or via dragging them or cutting and pasting sequences, plus you can easily paste over sequence names if you rename stuff from another source (like a species list in Excel). My experience is all based on BioEdit, but there is a chance that other programs may behave slightly differently when saving and opening fasta files that have extra information in the sequence name (which we add in before submitting to GenBank).

Open or paste your fasta file into word. Copy and replace ">" with ">^t" Select all (control-a), copy (control-c) and paste into excel (control-v), insert a column before the first column and number it from 1-n. Select all, then sort by column B, then column A. This will separate your sequence names from the sequences and keep the sequence names in the order you originally had them in. You will need to cut and paste your sequences names from the bottom of the spreadsheet to the top. If you don't care about the order of the sequence names, then simply sort by column C, then column A and all the names will be at the top.

I usually add the following definitions to the first row and then drag them (from their bottom right corner) to fill in the other rows. Obviously the organism, isolate, country and specimen voucher needs to be changed appropriately to match each sample. This all involves manipulation in excel using the concatenate function and copying and pasting stuff around. Note that in the country field the country must be followed by a colon if you include additional locality data

[organism=Amniataba affinis]

[molecule=DNA]

[moltype=genomic]

[location=mitochondrion] or [location=genomic] (if you have nuclear DNA. It is different for each genome type)

[Isolate=Aaffi.Fly.1] (that is my DNA code for that individual)

[Country=Papua New Guinea: Fly R.] (locality data, note that you must include the : after the country name to separate it from the remaining extra details.)

[specimen voucher=KU:I:2943:29881] The proper format for this is [specimen voucher=institution code:collection code:specimen_id]. If you have multiple vouchers just repeat the specimen voucher field (i.e., [specimen voucher=KU:I:2943:29881] [specimen voucher=KU:IT:T1682]) (the first is the formalin preserved voucher, the second represents the tissue sample number). Institution names and codes can be found at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/institution_codes.txt. This will allow GenBank to directly link to the voucher specimen in online versions of the museum catalog when available.

For a more phylogenetic dataset I use something like this for the country field.

[Country=Australia: SA, Bray Drain, Robe Naracoorte Rd, Pop. 1]

The full range of modifier options are found at http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html

These are all manipulated in excel and then added after each sequence name. I usually create an excel sheet with details from my locality table and use the concatenate function to create the entries. Here is an example of how I make the organism field (for a phylogenetic study) using concatenate. The upper box shows the function formula, while cell E1 shows what that looks like. You can then copy and paste special (values) to the sheet with the sequences.

Thus your file should start to look like this, with columns D, H and I filled in from the excel sheet above.

The final version should look like this (note: the specimen voucher fields are not in the proper format).

Once you have the fields filled in, you delete the blank cell before the sequence name, then resort the excel file by column A, which puts everything back in the correct order. Save the excel file in plain text format, then open it in word. Search and replace "^t" with " ", then search and replace two spaces for a single space until no more get replaced. Now save as plain text and give it the extension .fas (you may have to manually rename the file extension after closing it). Open that file in BioEdit or your favorite program to double check that the labels and sequences look good. Now your file is ready to import into sequin.

Sequin

First of all, check for whether you have the newest version of the program. When you open sequin, the first screen tells you the version you have. Go to http://www.ncbi.nlm.nih.gov/Sequin/index.html to see if that is still the current version.

The first few steps in sequin are straightforward. Once you have entered all of the data in the four tabs, go back to the first tab (submission) and export the contents to the sequin program directory (File, export submitter info). Note that the authors and affiliation should be based on those that conducted the sequencing, not typically all of the authors of the publication. This will save the contents from all four tabs to a text file which will save you having to retype it in when you mess something up and have to go back to the start, or if you have multiple submissions for the same paper. Just be sure when you import the data to be in the submission tab. You can also export each individual tab, which can be handy when doing new submissions which contain the same data for any single tab (like affiliation or contact). It also helps to copy your fasta file to the sequin directory to save you having to traverse your drive to find it. Just make sure that when you install a new copy of sequin to save those files beforehand or install sequin in a new directory.

After affiliation it will ask for what submission type, select the bottom option, "use the normal submission dialog."

In the next window for submission type I usually select phylogenetic study. For sequence data format I select alignment (FASTA+GAP…, etc.).

In the next screen hit "import nucleotide fasta." If this gives any errors then you may be better to fix them in the original fasta file and start over as I've seen problems later in the process even when sequin gives you the option to correct the error within sequin (although I think you need to revalidate the file within sequin after you correct the errors which may be why I had problems as I didn't do that). In the next tab (sequencing method) I select the top option, "sanger dideoxy sequencing." Do not enter anything under Assembly program as that is primarily for next generation sequencing submissions.

In the next screen, under the organism tab, if you have mitochondrial data, go to the organism tab, click on the add organism, locations, and genetic codes button. This pops up another window, click on the genetic code button (above the white boxes) and change the genetic code to the appropriate mitochondrial code.

If you have a simple submission where your sequences consist of a single coding region go to the annotation tab, click on CDS, click on incomplete at the 5` or 3` buttons if they are incomplete, add the protein name of your gene (I usually grab it from an existing GenBank entry if I am unsure) and gene symbol (cytb, S7, RAG1, etc) and click on the box to prefix species name to title.

If you have a more complicated submission with multiple coding regions or introns/exons then under the annotation tab select none and click on the box to prefix species name to title. Sequin will pop up a warning box, but click “continue to record viewer.”

It will now switch screens and let you do some other stuff.

In the new screen, click on ALL SEQUENCES at the top (by Target Sequence)

It is a good idea to save a backup copy at this stage so that if you mess something up you can reopen that file, rather than start from the beginning. If you use save or save as it will create a file that you can reopen. If you choose export it cannot be opened in sequin.

To add journal information go Annotate, publications, publication descriptor. Select the top option (PHYLOGENETIC SET), hit Edit Old and add in / change the relevant details. Note that this will insert the contents of the original data you entered from the first step of your submission. These details in terms of authors will be different in most cases, thus you will need to add the correct authors for the paper (the authors listed from before will be those responsible for creating the sequences). Again, it is a good idea once all the fields are completed to export these forms. Once they are all filled in, export from the title tab it will save the information from each tab as a single entity.

If you only have a single exon present then you are now ready to submit. If your sequences are more complicated, then skip the next three paragraphs and continue under the heading "more complicated situations."

Hit validate to check for errors. Most errors can be ignored, but it is a good idea to check and see the errors aren't something that you can fix.

Make sure that ALL SEQUENCES is selected, go file, prepare submission and save the file. If you have created an alignment, or have leading or trailing Ns, then do not remove it when asked! This file can be reopened later for further editing if needed. Once you remove the alignment though it is gone, you have to re-import it from scratch (or reopen a previously saved sequin file if you were smart enough to save a copy and not overwrite it).

Email the .sqn file(s) to gb-sub@ncbi.nlm.nih.gov and give a brief explanation of your data. If the species are not present in the GenBank database then you must include the appropriate lineage information too (usually I just get that for a similar taxon that is on GenBank already and just modify it). A sample email is provided at the end of these instructions.

More complicated situations

For sequences that span multiple non-adjacent genes, or that have exons and introns you need to define the characteristics of the alignment.

The first thing to do is to figure out what your regions are! I usually grab a previously curated sequence from GenBank and use that as a guide. There are a couple of things that many pcrers may be unaware of though relative to exon/intron boundaries.

Exon/intron boundaries usually have a specific sequence, the last two bases of the exon will be AG, the first two bases on the intron will be GT. The intron will end with a AG, and the next exon will start with a GT. This is not universal though. Note that when trying to figure out the codon positions in each exon, the exon/intron boundary is not related to codon position, that is the boundary can be at coding position 1, 2 or 3. So the next exon may not start on a whole codon, but may be partial, with them getting joined after transcription. To further complicate things, not all genes are translated from the start of the first codon!

There are multiple ways to define the characteristics of the alignment, what I present below is the simplest. You can alternatively provide alignment details in the submission process as follows, but that approach seems overly complicated to me:

Prepare and import annotation in a tabular format, known as FEATURE TABLE (FT). You can select the codon start in FT. FT allows you to add any complex annotation to a large number of sequences. The format of the feature table is explained on this page: http://www.insdc.org/documents/feature_table.html

The simple method…

Once you know your region boundaries you are ready to proceed. How you do this varies depending on your situation on whether you have overlapping genes, exons and introns, adjacent genes, etc. The following two examples cover adjacent genes followed by exon/introns.

Adjacent genes

For adjacent genes with an identical alignment go Annotate, coding regions and transcripts, cds. Under the coding region tab select the protein tab and add the full name of the gene. Under the location tab add the coordinates for that specific gene, add any details about whether the gene is incomplete at either the 5` or 3` end, at the bottom of the window select update mRNA span on accept and retranslate on accept. If the sequences in the alignment starts and ends on the same base with no gaps then leave sequence coordinates selected. If there are gaps then make sure alignment coordinates is selected. Repeat this step for each gene in your sequences. You will see that under the first sequence that sequin has added a new line for each gene you added (it is suffixed with a number proceeded with an underscore). If you have a tRNA, then go Annotate, coding regions and transcripts, mRNA, under the mRNA tab select tRNA, under Properties, General select Gene New, add the name under Gene Symbol, add the location coordinates (make sure to choose alignment coordinates if you have any gaps), then select Accept.

Now you need to propagate these features to the rest of your sequences. Your first sequence should be selected as the target sequence. Select the new genes and CDS regions that you just defined by clicking in the window anywhere within that section of text. Hold the shift key to select multiple ones. A thick vertical black line will appear to the left. Go edit, feature propagate, click on selected feature and hit accept. To check that this propagated properly choose ALL SEQUENCES, then scroll down and the new CDS section should be included with every sequence.

Scroll down to the section called "ready to submit?"

A gene with exons and introns

When you have your entire sequence within a gene go Annotate, genes and named regions. Under the gene tab add the gene name to locus. Under the location tab select alignment coordinates and hit accept. Go Annotate, coding regions and transcripts, cds. Set the reading frame under the coding region tab (this defines which base within the codon the sequence starts on). Under the protein subtab add the name of the gene. Under the properties tab click to the right of the word gene to set it to whatever you named the gene earlier (not sure if I need to do this or not). Then under the location tab be sure to select alignment coordinates as most datasets with introns will have gaps in them. Next set up the alignment coordinates for only the coding regions. For each region add the SeqID and Alignment. Add any details about whether the coding regions are incomplete at either the 5` or 3` end, at the bottom of the window select update mRNA span on accept and retranslate on accept. Once complete, hit accept.

Now you need to define each exon and intron region. Go Annotate, coding regions and transcripts, exon (or intron depending on which your sequences start). Under the exon tab add “exon 1” (or whatever number it is) to the number field. Under the properties tab click to the right of the word gene to set it to whatever you named the gene earlier (not sure if I need to do this or not). Under the location tab, make sure alignment coordinates is selected. Add the coordinates for that specific exon, add any details about whether the exon is incomplete at either the 5` or 3` end, at the bottom of the window select update mRNA span on accept and retranslate on accept. Repeat this step for each exon/intron in your sequences (I usually add them in order, i.e., exon 1, intron 1, exon 2, etc).

Now you need to propagate these features to the rest of your sequences. Your first sequence should be selected as the target sequence. Select the new gene, CDS region and each intron and exon that you just defined by clicking in the window. Hold the shift key to select both. A thick vertical black line will appear to the left. Go edit, feature propagate, click on selected feature and hit accept. To check that this propagated properly choose ALL SEQUENCES, then scroll down and the new sections should be included with every sequence.

Ready to submit?

Now you should be ready to submit. Hit validate to check for errors. In my experience there are always warnings about having BadStrucCommMissingFieldjust ignore them. From the sequin validation error window you can double click on any issue it identifies and it should take you to the correct place to fix the error. In my experience it is usually something I have overlooked, like Ns at the start of a sequence that span an intron/exon boundary or something like that. Make sure though that you always select alignment coordinates should you change anything within that tab.

Make sure that ALL SEQUENCES is selected (if you have individual sequences selected it only exports those specific ones), go file, prepare submission and save the file. If you have created an alignment, or have leading or trailing Ns, then do not remove it when asked! This file can be reopened later for further editing if needed. Once you remove the alignment though it is gone, you have to re-import it from scratch (or reopen a previously saved sequin file if you were smart enough to save a copy and not overwrite it).

Do not use the export genbank command as that file is not suitable for submission, nor can you reopen it for modification in sequin.

Email the .sqn file(s) to gb-sub@ncbi.nlm.nih.gov and give a brief explanation of your data. If the species are not present in the GenBank database then you must include the appropriate lineage information too (usually I just get that for a similar taxon that is on GenBank already and just modify it). Here are two sample emails.

Email 1

G'day there

Please find attached a bunch of sequences for submission to GenBank. These are for a paper that is accepted and the sequences can be released once processed.

One of the species listed includes a subspecies, Craterocephalus stercusmuscarum fulvus

Several cytochrome b sequences (in fluv.cytb.sqn) have a premature stop codon. This has been verified as being accurate. It is not uncommon for cytb sequences to have premature stop codons in any of the last few codon positions.

The cytb sequences that lack the premature stop codon are incomplete at the 3` end.

ATPase 8 lacks a stop codon, it gets corrected later. You'll see that records for this gene state: /note="TAA stop codon is completed by the addition of 3' A, e.g., http://www.ncbi.nlm.nih.gov/nuccore/AP004420

It also looks like I forgot to append the species name to the title for the fluv.atp.sqn submission too, so if you are able to correct that for me it would be greatly appreciated.

Thanks

Email 2

G'day there

Please find attached a bunch of sequences for submission to GenBank. These are for a paper that is accepted and the sequences can be released once processed.

One cytochrome b sequence (in tera.cytb.sqn) has a premature stop codon. This has been verified as being accurate. It is not uncommon for cytb sequences to have premature stop codons in any of the last few codon positions. Note that cytb does not usually have a complete stop codon, it gets added during transcription (or translation, I forget the exact details).

The file tera.rag1.sqn consists of partial exon 1, complete exon 2 and partial exon 3. Intron 1 and 2 are complete for most species. I provided the alignment details. Any Ns in the introns are of unknown length, but any Ns in the exons are of known length.

The file tera.rag2.sqn contains only exon sequence, that gene has no introns.

It seems like there is some duplication in the alignment details for tera.rag1.sqn, if you are able to correct those for me it would be greatly appreciated.

Thanks

Back to Unmack's Molecular Phylogenetics page.