Sequence data exploration and analysis using MEGA

By Peter Unmack

The goal here is to give you some exposure to raw sequence data, it’s assembly and some basic aspects of how to analyze data. We will also look at how patterns in the data are reflected in the results in the trees. This exercise is written for version 6 of MEGA, but the steps involved are almost identical in earlier versions.

First download Mega 6, install it, then download the data files (seqs.zip) and unzip them.

Start the program Mega 6.

Within Mega select Align > edit/view sequencer files (Trace).

Go to the directory which has the chromatograms you downloaded and select all (select one, then hit control-a) (there should be eight .ab1 files total) and hit open.

This the raw data that we obtain from sequencing pcr reactions. Scroll across and observe how the peaks and colors change.

What do you notice about sequence Porochilus.argenteus.Mach.1? What do the n’s mean (at bases 58 and 59)? Look at the chromatogram.

Hit the fifth button from the left on the toolbar to add the sequence to the alignment explorer. Then close the chromatogram file. Repeat this for all eight chromatograms.

Now the alignment explorer window will be open with all eight sequences included. Scroll across to the right to see all of the data. The next task is to delete all data after base 800. To find what base you are up to just click on one of the letters and in the bottom left hand corner it will show you the base number. You can also type in 800 and hit return and it will take you to that base. Select all of the bases after that base (from 801 to the right) (click on the boxes above the bases and holding the mouse down scroll right all the way to the end, wiggle the mouse to make it move quicker) then hit delete. Why are we deleting everything after 800 base pairs?

Go Data > export alignment > mega format

Name your data catfish.meg, in the next box you can call it whatever you want (anything, it doesn’t matter), then answer yes (this is protein coding sequence). Now close the alignment editor, it will ask you about saving the current alignment session to a file, answer no. Now open the file in Mega (Go File > open a file/session).

To see the sequence data click on the icon TA to open the explorer window. Normally we check the amino acid coding for errors. Click the button near the middle of the toolbar that says UUC > Phe Scroll across and look for any * or major mismatches. What does the ? mean?

Click the UUC > Phe button again so that you can see the raw sequence data, then close the sequence data explorer.

In mega click on phylogeny > construct/test neighbor joining tree

In the lower yellow box change gaps / missing data from complete deletion all to pairwise deletion, under substition model select Kimura 2-parameter model and hit compute. (click in the yellow box, it changes to an arrow, click the arrow and change the selection). Mega will remember your selections so that you don’t need to reset them each time (most of the time)

Now you will see another window, make it a bit larger so you can see the whole thing. This is a quick and dirty output of your dataset using NJ. You need to tell the program which OTU is the outgroup, select the third button from the top on the left side (has a little green arrow pointing at a tree) then click on the branch leading to Plotosus lineatus (you have to do this for every tree you generate).

Now, close this (it will always ask if you wish to save it, say no) and go back to phylogeny > construct/test neighbor joining tree. Run it with 1000 replicates. To change bootstrap replication number go to the phylogeny test, click in the upper yellow box, select bootstrap method, then make the no of bootstrap replications to 1000. Hit compute.

The tree window will appear, Set the outgroup, then go to view > options > branch, hide values lower than 50 (click the box, manually change the value to 50). You have to do this for every bootstrap tree.

Go to image > copy to clipboard

Open word, start a new document, hit return at least three times and paste the image into the second line (it is easier to see the lines if you have it set to show all characters).

Close the tree window (say no) and run it again with phylogeny > construct/test maximum parsimony tree, set it do 100 boostrap replications. Change gaps / missing data to use all sites and hit compute. Set outgroup and get rid of bootstrap values below 50. Copy and paste into word.

How do the two methods differ in the topology and bootstrap values?

Close the datafile using the close data icon, now open the file plotoside2.meg.

Repeat the steps above and run NJ and MP for both 100 and 1000 reps each. Save the trees into word (make a note in the document as to what is from what analysis and how many replicates you ran).

Do you get the same topology with more taxa? How do bootstrap values compare?

What does it all mean in terms of catfish relationships and systematics? Are all of the species monophyletic? Are all of the genera monophyletic? Which ones are not?

Print out a parsimony tree for the second dataset.

Open the explorer window by clicking on the icon TA. Go to highlight > variable sites. All of the sites that vary will be highlighted in yellow. What do you notice about the patterns of variation relative to coding position (note that things are grouped into threes across the top line). How does this vary in frequency?

See if you can determine how some of those variable positions are used to construct the tree. For instance, base 8 has a unique change in four OTU’s from a T to a C. Where does that change occur in the tree? The four OTU’s are all in the same lineage, thus this represents a single unique change in the ancestor to those four OTU’s.

Look at base 18, some individuals change from a T to a C (or is it a C to a T?). How does this change define various lineages?

Now look at base 27, most individuals are C, some are T, and one each is G or A. How does this define different lineages?

What about base 30? Does that tell us anything?

Look at a few more examples to see how the data vary, and how the method of analysis places those characters on the tree.

Back to Unmack's Molecular Phylogenetics page.