How to make a haplotype table and dataset?

By Peter Unmack

What follows is not the quickest, nor simplest way of doing this and it involves a lot of manual editing (which makes it prone to errors). I have never found a simple method that I like for doing this. MacClade comes close, but only runs on a Mac so I don't have easy access to it. The few program I have tried this in (except MacClade) renames all of your sequences, thus making it difficult to know what they were originally. In MacClade though you have to select the option that preserves the original sequence names.

I usually print an nj tree from MEGA with all individuals included. One way to reduce the dataset to haplotypes is to open the fasta version of the file in BioEdit and simply go through and remove any samples that are identical (based on the nj tree). A simpler way is to run the file in RAxML which will automatically create a reduced alignment with only unique samples (which can be converted back to fasta format using ruby). Note that different programs likely treat missing data and indels differently. RAxML is conservative (which is a better way to do it). Note that if a sequence has an N in it, it will be treated as being different, thus you will still have to do some manual deletion of those individuals (run a tree and check to see if any look identical in the tree--then look more closely at those sequences and see if the differences are due to Ns or indels). Samples with indel differences should be included in your dataset since they are a different haplotype, rather than just a sequencing ambiguity. Many programs treat missing data and gaps the same way when constructing trees. I don't know how other programs that you might use to create haplotype tables treat those characters, so be careful!

I usually only rename my sequences as haplotypes during the final step. First I create the final tree(s), rotate branches, make any format changes and bring the graphic file of the tree into PowerPoint. When I number sample names to haplotype numbers I start at the top of the tree and go down in order (easier for the reader to find a specific haplotype if they are in order). I usually make three versions of the tree graphic, one with only original labels, one with both the original label and the new label and one with only the new label. The reason I do this is because it can be a real pain to figure out what sample a specific haplotype is and I seem to have to figure this out multiple times for every dataset I generate. I also re-label sample names in the original haplotype data file in BioEdit so that they have both the original sequence name and the haplotype number separated by a unique special character (I do this to help me when I submit sequences to GenBank so I can match the haplotype to the specific individual that sequence is from). Once the haplotype numbers are established I make a table in Excel and simply count the number of individuals present for each haplotype. Be sure and double check all numbers, I always double check that the number of samples I sequenced independently matches the number of haplotypes listed for that population. Likewise, it is good to check the number of individuals with a certain haplotype matches what is in your tree too. Because this is all done manually you will definitely make mistakes, I guarantee it! So be sure to check things in multiple ways and/or multiple times.

Back to Unmack's Molecular Phylogenetics page.