Data file manipulation

By Peter Unmack

How do you take your sequence data and format it for analysis?

There are many different ways to go back and forth between different file formats. Here I describe my scheme. If you first need to concatenate multiple data files together see the section at the end first. Most programs that I have examined for importing and exporting different file formats always seem to introduce artifacts or problems that annoy me. The only one I’ve seen that comes close to being nice is the new version of PAUP. Irrespective of that, PAUP is also great to use for editing data files as it keeps all the data on a single line (whereas Word wraps it) and it handles copying and pasting and searching and replacing quite nicely.

It is good to develop a standard naming system for files. This is typical of what I have, although I usually end up with many more. I start with the taxon or group of interest, the gene and then what is in the content of the file. You will potentially end up with many many files for different analyses that you do and it is critical to be able to identify which is which. Inevitably you will find errors and have to go back and update/change files as well, thus add something to the name to make that clear (e.g., final, fixed.july.7, etc.). Simply sorting by date many not work if you are actively working with several of the files.

birdshead.cb.final.meg

birdshead = group of interest, cb = gene abbreviation

birdshead.cb.meg

mega format

birdshead.cb.fas

fasta format

birdshead.cb.phy

phylip format

birdshead.cb.nex

nexus format

birdshead.cb.mt.nex

Modeltest generation file

birdshead.cb.model.scores

output from paup for obtaining model scores

birdshead.cb.model.scores.out

output from modeltest containing the model scores

birdshead.cb.ml.nex

ML analysis file

birdshead.cb.ml.tree.nex

ML tree file

birdshead.cb.mlb.nex

ML bootstrap analysis file

birdshead.cb.mlb.tree.nex

ML bootstrap tree file

birdshead.cb.mp.nex

MP analysis file

birdshead.cb.mpb.nex

MP bootstrap file

birdshead.cb.mpb.tree.nex

MP bootstrap tree file

birdshead.cb.each.spp.group.meg

mega file with species groupings

birdshead.s7.meg

S7 = gene abbreviation

birdshead.comb.fas

combined file with both genes

birdshead.comb.phy

combined file with both genes

Start with fasta format

My preference is to start with a fasta format file and convert that to various other formats.

I use BioEdit for managing my sequence data. Which program you use shouldn’t matter as long as it creates fasta files in the same format as presented here. BioEdit saves fasta formatted files with the sequence name on the first line and the entire sequence on the second line (with no hard returns until the end). You must have the data in this format for these instructions to work! Open the fasta file in BioEdit. Check that the sequences are all aligned and ok. Save the file if you make any changes. Be sure that BioEdit is saving it in fasta format too (e.g., use the save as rather than the save option the first time you save it) as BioEdit does not automatically change any file extensions, you have to manually do that when you save the file.

Create MEGA file

Right click the fasta file and open it in Microsoft Word (or any editing software that will save a txt file). To create a MEGA file from the fasta file search and replace in word (control-h) for > and replace with #, make a mental note of how many replacements were made.

Copy and paste this header (or one from one of your previous MEGA files), change the values in yellow to suit your situation.

#MEGA
!Title add a title if you like;
!Format
DataType=Nucleotide
NTaxa=408 NSites=1140
Identical=. Missing=N Indel=-
CodeTable=Vertebrate_mitochondrial;
!Domain=Data Property=Coding;

Save the file as a txt file with the extension of meg (make sure that your version of Windows is set to show the file extensions ( http://windows.microsoft.com/en-US/windows-vista/Show-or-hide-file-name-extensions ) so that you can see if the extension is correct as the software will often put another extension on the file that you cannot see. Close the file in Word, double click it and it should open in MEGA. MEGA will prompt you if there are any errors so you can fix them.

Create phylip file

I use scripts written by Simon Berger available from https://github.com/sim82/ruby_tools/tree/master/lib that use ruby. Ruby can be obtained from https://www.ruby-lang.org/en/. For Windows the installer is found at https://rubyinstaller.org/. I install ruby in the directory C:\ruby (from memory you should avoid installing it in directories with spaces in the name). In the ruby directory I create a shortcut to a command prompt that opens in this directory. Right click in the ruby directory, new, shortcut, enter cmd, hit next, hit finish. Right click on the new shortcut, select properties, change start in to c:\ruby, hit ok and you are done.

Make a copy of the fasta file and paste it into the ruby directory (C:\ruby). Double click on the file listed as cmd.exe in your ruby directory which will open a command prompt in the ruby directory. I use two scripts Simon wrote that convert to and from fasta to phylip. The scripts are called fasta_to_phy.rb and phy_to_fasta.rb.

Type ruby fasta_to_phy.rb infile.fas outfile.phy (change infile and outfile to your file names). Note you can type ruby fa hit tab and it will auto complete the script name, add a space then type the first couple letters of your data file and hit tab, add a space and type the first couple of letters of your data file and hit tab, backspace and change fas to phy.

Move the output.phy file to your working directory of analysis data files.

Note that this creates a relaxed phylip format. Strict phylip format only allows ten characters for the sequence name (which is why almost none of the other converters I have ever seen work as they only export the strict format). Essentially the script puts the sequence name and the sequence all on one line with the correct spacing between the sample name and the sequence so that they are all aligned.

Create nexus file

Right click and open the phylip file just created as described above in Word.

The first number is the file is number of taxa, the second is the number of base pairs.

Paste the following header at the top of the file.

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=38 NCHAR=1141;
FORMAT DATATYPE=DNA MISSING=N GAP=-;
MATRIX

Replace the numbers with the values in the phylip file (from the original first line of the file). Then delete the original first line. Go to the end of the file and add ;end; and then go file save as and change the file extension to .nex

How do I concatenate multiple genes into a single data file?

You have to first combine two fasta files together which creates an interleaved fasta file and convert it to non interleaved file. The first trick is to ensure that all sequences are in the same order in both files and have identical names. I usually have a list in Excel with the correct labels I want to use in my final tree (if you can figure out the names you want in your final tree first then you will save yourself considerable time relabeling them when you have to rerun your final tree). You can either arrange each file by title manually (if you want a specific order), or you can sort each file alphabetically by title (sequence > sort > by title) which you can also do for your list in excel. If you use consistent standard sequence names then this will be much easier to do. Either delete sequences that are not present in both datasets, or add a sequence consisting of missing data for the sequences that are missing. I then select all the sequences names in BioEdit (Edit, Select All Sequences, control-shift-a), copy them to clipboard (Edit, Copy sequence titles) and paste them into Excel. Do this for the second fasta file into the adjacent column.

Once you are sure everything is in the correct order paste the new sequence names from Excel over the sequences names in both fasta files in BioEdit (Edit, Paste Over Titles). You must be very precise when you do this though! I always keep have a copy of the file with both sequence names (the original and the new one) which you can create with Edit, Paste Onto Titles which appends them together. Just add a unique character between the names to make it easy to remove either name should you wish to later (you could copy the names into Word, search and replace unique character with a tab (a tab is denoted by ^t), paste into excel, select the column of names you need and paste it back into BioEdit).

Save each fasta file (to a new file if you don't wish to over write your original files). Open each file in Word and then paste one of the files at the end of the other file. Leave a blank line between them (makes it easier to know where one gene ends and another starts). Save it as a txt file (change extension to fas).

Open this file in file DataConvert. This is a small program produced by David McClellan’s lab. I couldn’t find it last time I searched online, but I have a copy here you can download. Open DataConvert, click on fasta and be sure that the option of entire sequence on one line is selected. Click on open file and select your fasta file. Make sure the file type in the bottom left is set to fasta, then hit write files. This will generate a fasta file with the sequences combined into a single entry per sequence name. Open this file in Word and search for tabs and replace them with nothing (a tab is denoted by ^t) and save it. Follow the instructions above to then convert this to MEGA, phylip and nexus. Despite many options being available in DataConvert for various formats they do not work properly for many applications as it uses tabs rather than spaces to separate sequence names from sequences.

Back to Unmack's Molecular Phylogenetics page.