Phylogenetic trees reconstruct relationships between species or individu
als using molecular data. Two commonly used methods for reconstructing relatio
nships are maximum likelihood (ML) and maximum parsimony (MP). Maximum likelih
ood evaluates a hypothesis about evolutionary history in terms of the probabilit
y that the proposed model and the hypothesized history would give rise to the ob
served data set. The topology with the highest maximum probability (likelihood
) is chosen. Maximum parsimony infers a phylogenetic tree by minimizing the to
tal number of evolutionary steps required to explain a given set of data.
Maximum Likelihood
Advantages of maximum likelihood methods over other methods are: may hav
e lower variance than other methods (least affected by sampling error), tend to
be robust to violations of the assumptions in the evolutionary model, are statis
tically well founded, can statistically evaluate different tree topologies and u
se all of the sequence information. There are also some disadvantages: very co
mputationally intensive (slow) and the result depends on the model of evolution
(Opperdoes, 1997a).
There are four maximum likelihood programs available on iNquiry.
Program Data
Tree-Puzzle DNA or Protein sequence
PAML (CODEML) DNA or Protein sequence
DNAML (PHYLIP) DNA sequence
fastDNAml DNA sequence
Tree-Puzzle
Tree-Puzzle is a program for maximum likelihood analysis of DNA or protein seque
nce data. This program implements a fast tree search algorithm, quartet puzzli
ng, that allows analysis of large data sets and automatically assigns estimation
s of branch support to each internal branch. It also computes pairwise maximum
likelihood distances as well as branch lengths for user specified trees.
Input: Sequence input is requested as an alignment file in PHYLIP interleaved fo
rmat. The user must choose the type of sequence, DNA or protein.
Options: The user may choose the model of substitution to be applied, HKY (Haseg
awa et al 1985) is the default for DNA and Dayhoff (Dayhoff et al. 1978) is the
default for protein sequence. The user may input the transition/transversion r
atio and nucleotide frequencies, however if these are left blank the program wil
l estimate them from the data set. There are options for the model of rate het
erogeneity, the default is uniform rate. The last two options are for a user-s
pecified tree and the output options. In the output the user may specify a seq
uence to be designated as the outgroup, this should be the number of the individ
ual in the alignment file (for example, the first sequence would be 1, the fourt
h sequence would be 4).
Output: Tree-Puzzle, when used with the default options, gives a summary of the
sequence data input, maximum likelihood distances, an quartet puzzling tree and
any other trees that occurred more than 5% of the time in the 1000 (default) puz
zling steps.
MAXIMUM LIKELIHOOD BRANCH LENGTHS ON QUARTET PUZZLING TREE (NO CLOCK)
Branch lengths are computed using the selected model of
substitution and rate heterogeneity.
:----3 AF157877
:----6
: :----------4 AF157953
:-----7
: :----------------------5 GVO389531
:
:---2 AF157941
:
:--1 AF157928
branch length S.E. branch length S.E.
AF157928 1 0.01919 0.00460 6 0.02588 0.00694
AF157941 2 0.02246 0.00491 7 0.03991 0.00797
AF157877 3 0.03741 0.00725
AF157953 4 0.10455 0.01119 8 iterations until
convergence
GVO389531 5 0.23022 0.01835 log L: -3347.60
Quartet puzzling tree with maximum likelihood branch lengths
(in CLUSTAL W notation):
(AF157928:0.01919,((AF157877:0.03741,AF157953:0.10455)51:0.02588,
GVO389531:0.23022)100:0.03991,AF157941:0.02246);
PAML
In the PAML package on iNquiry is the program codeml, which does maximum likelih
ood for DNA or protein sequence. Two old PAML programs, baseml and codonml, we
re combined to create codeml.
Input: DNA or protein sequence may be directly pasted in or a file may be specif
ied. The sequence data must have the number of sequences and the number of cha
racters, followed by the sequence name, then the sequence (see example input for
ProtPars). The user may also input a tree structure file.
Options: There are options for the general run of the program and ones specific
for DNA and protein. The common options are for the output file names, the typ
e of sequence, the tree, and other parameters for estimating trees. It is very
important to specify the tree to be used (the user must choose an option from t
he pull-down list, the default is 0, or user-specified tree, if not supplying a
tree). Codon sequence options are for DNA sequence data and include model, cod
on frequency, genetic code, kappa and omega values. It is very important to sp
ecify the genetic code to be used (the default of 0, universal code does not wor
k for mammalian mitochondrial DNA sequence). Amino acid sequence options are t
he model, alpha and the matrix. If the empirical models are chosen from the pu
lldown menu the user must specify a matrix file.
Output: There are three output files from paml: rst gives codon sites with posit
ion differences and star trees, mlc gives site patterns, sequence differences, c
odon usage in sequences, a distance matrix and the best tree.
best tree: (((1, 2), 4), 3, 5); lnL: -2853.476553
DNAml
DNAml is part of the PHYLIP package, fastDNAml performs the same functions using
less memory.
fastDNAml
FastDNAml performs unrooted maximum likelihood on aligned DNA sequence. It is
faster than DNAml and has the ability to save progress toward finding a tree (ca
n be restarted from a checkpoint).
Input: Aligned DNA sequence.
Options: The user may specify the base frequencies or check the box for the prog
ram to derive them from the sequence data. The user may specify an outgroup (b
y the order of the sequences as in Tree-Puzzle) and the transition/transversion
ratio. If the interleaved box is left checked the program will convert the seq
uence from FASTA format to PHYLIP interleaved format. There are options for bo
otstrapping the tree(s) found by the program. There are also options for the d
isplay of the output and the rearrangements of trees. The last two options are
for user-specified weights and trees.
Output: The first output file is a tree and the second is a summary of the resul
ts. It gives the aligned sequences with any variable positions, which are call
ed distinct data patterns. The program finds an unrooted tree and gives branch
length values and approximate confidence limits.
Maximum Parsimony
Maximum parsimony methods search all possible tree topologies for the op
timal (or minimal) tree. Advantages of maximum parsimony are: it is based on s
hared and derived characters, therefore a cladistic method, it tries to provide
information on the ancestral sequences and evaluates different trees. Disadvan
tages are: does not use all the sequence information (only informative sites),
does not correct for multiple mutations (no model of evolution), does not provi
de information on branch lengths and it is sensitive to codon bias (Opperdoes, 1
997b). For more information on parsimony see Felsenstien (2004). There are t
wo maximum parsimony programs for sequence data available on iNquiry, both are f
rom the PHYLIP package.
Program Data
PROTPARS Protein sequence
DNAPARS DNA sequence
PROTPARS
This program applies a novel method for inferring unrooted phylogeny from protei
n sequences. The user should consult the fine manual for the program for the a
ssumptions of the method.
Input: Aligned protein sequence, where the first line contains the number of spe
cies and the number of amino acid positions, then the species data. Each seque
nce starts on a new line, has a ten-character species name, immediately followed
by the species data in one-letter code.
Options: There is an option for using threshold parsimony and specifying the thr
eshold value as well as specifying the genetic code to be used. There are also
options for randomizing and bootstrapping as well as input for a user-specified
tree. The user may choose the output options and specify an outgroup, by desi
gnating the sequence by the order (the first sequence is 1, etc.).
Output: The program gives the most parsimonious tree (or trees).
One most parsimonious tree found:
+-----------05CYB_GLV
4
! +--------04CYB_MAM
+--3
! +-----03CYB_SPT
+--2
! +--02CYB_SPT
+--1
+--01CYB_SPM
remember: (although rooted by outgroup) this is an unrooted tree!
requires a total of 64.000
DNAPARS
This program searches bifurcating and multifurcating trees for the most parsimon
ious trees and saves a number of trees tied for best and rearranges all of the s
aved trees.
Input: Aligned protein sequence, where the first line contains the number of spe
cies and the number of amino acid positions, then the species data. Each seque
nce starts on a new line, has a ten-character species name, immediately followed
by the species data in one-letter code.
Options: There is an option for using threshold parsimony and specifying the thr
eshold value. There are also options for randomizing and bootstrapping as well
as input for a user-specified tree. The user may choose the weight and output
options and specify an outgroup, by designating the sequence by the order (the
first sequence is 1, etc.).
Output: The program gives the most parsimonious tree (or trees) and distances.
References
All information contained in this document was obtained from the respective fine
manual of the program or Nei, M. and Kumar, S. 2000. Molecular Evolution and Ph
ylogenetics. Oxford University Press, Inc., New York, unless cited otherwise.
Felsenstein, J. 2004. Statistical properties of parsimony, pp. 97-122 and A digr
ession on history and philosophy, pp. 123-146, in Inferring Phylogenies. Sinaur
Associates, Inc., Sunderland, Massachusetts.
Opperdoes, F. 1997. Maximum Likelihood. Retreived 20 April 2004. http://www.icp.
ucl.ac.be/~opperd/private/max_likeli.html .
Opperdoes, F. 1997. Maximum Parsimony Analysis. Retreived 20 April 2004. http://
www.icp.ucl.ac.be/~opperd/private/parsimony.html .
