How to phase data with DnaSP

By Peter Unmack

Phasing nuclear sequence data can be a real headache. Fortunately, the program phase is built into DnaSP which makes things somewhat easier. Of course, with any software you can generate output from inappropriate input! Obviously, the program assumes you have a complete sample that contains all of the alleles found in a population. Thus the more individuals you sample per population the more accurate your phased alleles will be.

The phase program will try and designate bases for any missing data, so you should either change Ns to a gap or remove those data columns with Ns before running phase. Only the phase option will work with columns that have more than 2 variable bases in the dataset at the same position (fastPHASE and HAPAR will not). DnaSP will also truncate OTU names at 17 characters in the files it outputs.

Open DnaSP

Open>open unphase / genotype data file

That will pop up a quick analysis of your data that looks like this.

Input Data File: C:\...\tricho.rag1.cons4.fas

Number of individuals: 179

Number of sequences: 358

Number of sites: 941

Invariable (monomorphic) sites: 771

Number of polymorphic (segregating) sites: 63

Number of polymorphic (segregating) sites with more than two variants: 1

Number of positions with gaps: 18

Number of positions with missing data: 89

fastPHASE and HAPAR does not accept data files with more than 2 variants per position.

First position with more than two variants: 765

Note that position 765 needs to be "fixed" before it can run.

When you run the file I usually use phase with the default settings. Not really sure if those are good or not though.

DnaSP will create a directory called DnaSPhase in the directory of the file that you are analyzing. All of the output is placed in that directory. It will create several files, one of which has the main output with all the relevant details in it. That file appears to usually have the same name as your original input file. The second last section of that file starts with "Haplotype estimates for each individual". This is the section you need in order to figure out what the phased alleles are. Output looks like this for an individual that had two heterozygous sites:

0 #AM9753.1.cons.p2

= = = = = = = = = = = = = = = = = (A) = = = = = = A = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

= = = = = = = = = = = = = = = = = (G) = = = = = = G = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

The () indicate that phase was less certain about which base goes with which allele. The section following this one provides the probability values for each base call.

Before you close the program, be sure to export your data via save/export data as …! That will output a new version of your datafile, with each sequence represented by two alleles. DnaSP also adds a comment to the first OTU which I remove. If you forget to export the data file, you’ll have to go back to your original data and manually change the heterozygous positions based on the phase output (or run it again). If you do this though, be extremely careful though to ensure that you keep which base goes with which allele correct!