返回首页

Pairwise and multiple sequence alignment

时间:2004-11-23 03:51来源: 作者:unkonwn 点击:
  Pairwise and multiple sequence alignment Introduction

 

Related (homologous) DNA and protein sequences can be optimally aligned by computer, applying suitable algorithms. Simple ones introduce gaps where necessary to optimise the alignment, and only score identical residues in the same place in the two or more sequences.

Scoring homology

For DNA comparisons, identity scores 1, and non-identity scores 0. Because transition mutations are more likely than transversions, an intermediate score might be given if purine is opposite purine or pyrimidine is opposite pyrimidine.

For protein sequences, the simple scoring of 1 for a match and 0 for a mismatch can be elaborated much further. Non-identical amino acids can be related by physico-chemical properties size, charge, acidic or basic, hydrophilic or hydrophobic, and comparison of real sequences shows that conservative replacement (replacement by a similar amino acid) is more likely to occur/persist than non-conservative; thus it can be given an intermediate value in calculting similarity. A second possibility is to compare amino acids in a given position by the number of mutational changes required to achieve the change of one to the other (one, two or three BPS). An examination of the genetic code shows that there are in many cases codon similarities between physicochemically similar amino acids, so the two approaches are not entirely independent. Weighting scoring for similarities between amino acids has led to the development of a number of scoring matrices, MDM78 and the BLOSUM and PAM series. These are used in alignment programs both to optimise alignment and to determine the final similarity score.

Global or local homology

Sequence similarities can be global (throughout an entire gene and/or its encoded protein, or local (a conserved domain in otherwise dissimilar genes and/or proteins. Needleman and Wunsch (1970) devised an algorithm for global comparison of two protein sequences, which is conceptually similar to a dot-plot, identifying the best path through a 2-D comparison of two sequences, scoring similarity with a suitable scoring matrix, and deducting penalties for gaps in either sequence necessary for optimal alignment.

For optimal detection of local homology, the first solution was that of Smith and Waterman (1981). However, as we are concerned here with global alignment, we′ll leave that for now.

Multiple alignment

Multiple sequence alignment is a prerequisite to determining levels of homology, and hence relatedness, between members of a series of globally related sequences. Those sequences can be genes or proteins with the same function in different species (homologs), related but diverged genes or proteins resulting from gene duplication in the same species (orthologs), or functionally diverged genes or proteins from different species (also orthologs).

There are a variety of different approaches to multiple sequence alignment, some just mechanical calculations, and others attempting to reconstruct the actual evolutionary path from the common ancestor to the current derivatives. The most popular representative of the former is the neighbour-joining algorithm at the heart of CLUSTAL. The archetypal example of the latter evolutionary approach is the "maximum parsimony" of the PHYLIP suite of Felsenstein. "Maximum parsimony" is an attempt to compute the minimum number of mutations required to evolve from a single ancestral sequence to achieve those variant homologs in the dataset under analysis. It is extremely CPU-intensive and invariably comes up with alternative solutions. "Neighbour joining" is a method in which all pairwise comparisons are made within the sequence set, the most similar sequences are grouped and optimally aligned, and then groups are progressively added into the alignment. CLUSTAL is the prime example of this type of program. It is fast and efficient, and although there is no direct "biological" basis as there is in parsimony methods, its results are generally sensible.

CLUSTAL has been through a series of versions, the most recent of which is CLUSTAL X. However, CLUSTAL W is still found on some computer systems, as not all will display the full features of the latest version. CLUSTAL X is running on the PC cluster, and may be downloaded from Strasburg for installation on your own computer. Clustal X offers come choice of alignment parameters (gap penalties etc.), alternative scoring matrices for both protein and DNA sequences, and a range of output format options for the aligned file (Clustal, GCG/MSF, GDE, NBRF/PIR, Phylip).

How to use Clustal X for multiple alignment Clustal can be found on ISS cluster computers in the Genetics software via the Departmental software icon. On the M Res cluster it is located on the C: drive in Biology Software/Clustalx/.

Sequence files mentioned below are at this location and should be copied to your home directory for use.

Clustal reads in a data file containing individual sequences in one of several standard formats, and the data files provided are in NBRF/PIR format. The first file is globins.ord containing a selection of hemoglobin proteins and myoglobins. The second is someglobins.ord which contains a subset of the sequences in globins.ord , and the third is humanglob.ord containing only human hemoglobins and myoglobin.

Start up Clustal X, go to the File menu, and Load sequences. To do this, select the globins.ord file. The window will display the unaligned sequences, coloured according to their physicochemical type. To the left, individual sequences are identified by code. Below is a graph showing the level of identity at each position. Look at the Alignment menu to see options for changing alignment parameters, output formats etc.. Ignore these for now and select Do complete alignment. Accept the default alignment file name, globins.aln. Watch the indication of progress shown in the bottom left of the Clustal window. At its completion, the optimally aligned sequences are displayed. The vertical stripes of colour show alignments of identical or similar residues, the stars above indicate identity of all sequences at a position, and the graph below indicates the similarity score at each position. Scroll across the alignment and observe the way that gaps have been inserted in some subsets to achieve optimal alignment. There are few of these as the hemoglobin-myoglobin family is quite highly conserved.

Repeat with someglobins.ord, and save the default .aln alignment file. Then look in the Alignment menu at Output format options. The results of saving in the different output formats can be examined by viewing the .aln, .ph etc. files in Notepad. Then look at in the Alignment menu at Alignment parameters, Multiple alignment parameters and the consequences on the alignment at the options to change gap penalties and scoring matrices.

Finally align humanglob.ord, and save the alignment in the default .aln format.

As all three alignments have the three human sequences in common, compare the major features of the .aln alignments to see if the major features are conserved (are gaps all in common, do all three alignments show the most highly conserved regions in the same places?). Calculating "phylogenetic trees" (really cladograms) with Clustal X Multiple alignments are used to identify the differential levels of conservation of different parts of homologous or orthologous sequences, insertions or deletions, and other features. They are also the basis for calculation of the levels of similarity between sequences, and therefore their evolutionary distances apart. This is then the basis for the computation of evolutionary trees (strictly speaking, "cladograms").

Clustal X has a fast and efficient tree calculation function, accessed via the Trees menu. Load Clustal X, and load one of the .aln alignment output files you created above. Go to the Trees menu and Draw .nj tree ("nj" stands for "neighbour joining"). Accept the default .ph output file and do and save the tree. Look at this file in notepad and observe the way in which the tree is represented in summary format, with brackets indicating the grouping of nodes (sequences) and numbers indicating branch lengths. It is possible (but not necessary as we shall see later) to manually draw tree from the data in this file

(
(
(
(
(
(
(
(
(
PYR5_DICDI:0.25529,
NGRUMPA:0.27184)
:0.01798,
PYR5_DROME:0.27845)
:0.01037,
(
(
S46440:0.09167,
NTU22260N:0.08411)
:0.11295,
(
(
PYR5_BOVIN:0.06227,
PYR5_MOUSE:0.05224)
:0.00196,
PYR5_HUMAN:0.05147)
:0.14300)
:0.03529)
:0.01606,
(
(
(
(
(
DCOP_TRIRE:0.05587,
DCOP_TRIHA:0.05260)
:0.05237,
(
(
JC4103:0.00000,
DCOP1ACRSP:0.00000)
:0.00842,
JC4104:0.00798)
:0.10264)
:0.01563,
DCOP_CEPAC:0.12611)
:0.06705,
(
DCOP_NEUCR:0.03103,
DCOP_SORMA:0.02719)
:0.16920)
:0.08808,
(
(
(
(
DCOP_ASPNG:0.04464,
AOPYRG:0.03839)
:0.01994,
AFPYRG:0.05948)
:0.02222,
DCOP_PENCH:0.08276)
:0.03406,
DCOP_ASPNI:0.13230)
:0.12488)
:0.01190)
:0.01193,
S41014:0.37098)
:0.01771,
(
(
(
DCOP_SCHCO:0.17570,
DCOP_USTMA:0.18694)
:0.02268,
(
(
(
DCOP_RHIRA:0.00000,
DCOP_MUCCI:0.00000)
:0.06460,
DCOP_PHYBL:0.04484)
:0.02105,
DCOP_RHINI:0.07706)
:0.09509)
:0.01007,
DCOP_SCHPO:0.19014)
:0.03503)
:0.04420,
DCOP_ENDMA:0.17028)
:0.01150,
DCOP_YARLI:0.18332)
:0.03980,
(
(
(

(

(

(

AB006207:0.08928,

DCOP_CANTR:0.06428)

:0.01160,

DCOP_CANMA:0.06042)

:0.00512,

DCOP_CANAL:0.09607)

:0.01706,

(

(

 

DCOP_PICST:0.07731, DCOP_CANBO:0.10995)

 

:0.01267, DCOP_PICOH:0.07952)

:0.00553)

:0.00610,

(

DCOP_HANPO:0.00000,

DCOP_PICAN:0.00000)

:0.11371)

:0.00429,

 

DCOP_CANPA:0.12992) :0.00416,

(

(

(

DCOP_HANAN:0.00000,

PAURA3:0.00000)

:0.06414,

DCOP_HANFA:0.05284)

:0.04980,

(

(

DCOP_KLULA:0.05977,

DCOP_KLUMA:0.06383)

:0.01836,

(

DCOP_YEAST:0.10828,

DCOP_CANGA:0.07285)

:0.00120)

 

:0.04310) :0.01624);

 

Go back into Clustal X with the same .aln alignment file, into the Trees menu, into the Output options menu, and select Clustal format output (.nj file extension). Create this output file and load it into notepad. This output file is much longer, and gives details of all steps in the tree calculation process starting with the scores for all pairwise comparisons and ending with computed branching order and branch lengths. Again the data provide all that is necessary to draw the tree manually.

Cycle 1 = SEQ: 45 ( 0.00000) joins SEQ: 46 ( 0.00000)

Cycle 2 = SEQ: 44 ( 0.00800) joins Node: 45 ( 0.00844)

Cycle 3 = SEQ: 47 ( 0.03123) joins SEQ: 48 ( 0.02760)

Cycle 4 = SEQ: 41 ( 0.05632) joins SEQ: 42 ( 0.05272)

Cycle 5 = SEQ: 37 ( 0.04117) joins SEQ: 38 ( 0.03547)

Cycle 6 = SEQ: 14 ( 0.00000) joins SEQ: 15 ( 0.00000)

Cycle 7 = Node: 37 ( 0.02006) joins SEQ: 39 ( 0.05475)

Cycle 8 = SEQ: 8 ( 0.04930) joins SEQ: 9 ( 0.05070)

 

Cycle 9 = SEQ: 7 ( 0.06344) joins Node: 8 ( 0.00195)

Cycle 10 = Node: 41 ( 0.04957) joins Node: 44 ( 0.10188)

Cycle 11 = SEQ: 36 ( 0.08045) joins Node: 37 ( 0.02291)

Cycle 12 = Node: 41 ( 0.01669) joins SEQ: 43 ( 0.12586)

Cycle 13 = SEQ: 5 ( 0.09028) joins SEQ: 6 ( 0.08227)

Cycle 14 = Node: 36 ( 0.03460) joins SEQ: 40 ( 0.13176)

Cycle 15 = SEQ: 13 ( 0.04179) joins Node: 14 ( 0.06205)

Cycle 16 = Node: 13 ( 0.01937) joins SEQ: 16 ( 0.07294)

Cycle 17 = Node: 41 ( 0.06619) joins Node: 47 ( 0.16627)

Cycle 18 = SEQ: 19 ( 0.00000) joins SEQ: 27 ( 0.00000)

Cycle 19 = SEQ: 17 ( 0.00000) joins SEQ: 26 ( 0.00000)

Cycle 20 = Node: 5 ( 0.11491) joins Node: 7 ( 0.14222)

Cycle 21 = SEQ: 10 ( 0.16349) joins SEQ: 11 ( 0.18238)

Cycle 22 = SEQ: 1 ( 0.26303) joins SEQ: 3 ( 0.27185)

Cycle 23 = Node: 1 ( 0.01156) joins SEQ: 4 ( 0.27740)

Cycle 24 = Node: 10 ( 0.02746) joins Node: 13 ( 0.09652)

Cycle 25 = Node: 1 ( 0.01095) joins Node: 5 ( 0.03542)

Cycle 26 = Node: 36 ( 0.12477) joins Node: 41 ( 0.08820)

Cycle 27 = Node: 10 ( 0.00883) joins SEQ: 12 ( 0.18851)

Cycle 28 = Node: 1 ( 0.01392) joins SEQ: 2 ( 0.36383)

Cycle 29 = Node: 1 ( 0.01238) joins Node: 36 ( 0.02250)

Cycle 30 = Node: 1 ( 0.01801) joins Node: 10 ( 0.03471)

Cycle 31 = SEQ: 18 ( 0.05203) joins Node: 19 ( 0.06161)

Cycle 32 = SEQ: 20 ( 0.05263) joins SEQ: 21 ( 0.05763)

Cycle 33 = Node: 1 ( 0.04521) joins SEQ: 31 ( 0.16974)

Cycle 34 = Node: 20 ( 0.01932) joins SEQ: 22 ( 0.10616)

Cycle 35 = Node: 20 ( 0.00114) joins SEQ: 23 ( 0.07205)

Cycle 36 = Node: 1 ( 0.01227) joins SEQ: 35 ( 0.18286)

Cycle 37 = SEQ: 28 ( 0.09114) joins SEQ: 30 ( 0.06476)

Cycle 38 = Node: 1 ( 0.03697) joins Node: 20 ( 0.04391)

Cycle 39 = SEQ: 25 ( 0.06779) joins SEQ: 33 ( 0.11403)

Cycle 40 = Node: 28 ( 0.01046) joins SEQ: 29 ( 0.06153)

Cycle 41 = Node: 28 ( 0.00503) joins SEQ: 32 ( 0.09729)

Cycle 42 = Node: 1 ( 0.00664) joins Node: 17 ( 0.11223)

Cycle 43 = SEQ: 24 ( 0.08059) joins Node: 25 ( 0.01120)

Cycle 44 = Node: 18 ( 0.06399) joins SEQ: 34 ( 0.12351)

Cycle 45 = Node: 1 ( 0.01070) joins Node: 18 ( 0.00538)

 

Cycle 46 (Last cycle, trichotomy): Node: 1 ( 0.00245) joins

Node: 24 ( 0.00547) joins

Node: 28 ( 0.01481)

Another option in the Trees menu is the option of including or excluding gaps. A gap requires no more than a two-hit event to create it, but it can be given too much weight in a pairwise comparison. Hence this option.

A problem with reconstructing evolutionary trees is that it is (or was) an "one off" experiment. There is no way of repeating it to experimentally verify if it is repeatable. Various methods have been devised to attempt to validate trees, and a favoured one used in Clustal X is "bootstrap resampling". This resamples the dataset a number of times (each with a different subset) to see how robust the predicted tree is (what proportion of the time the commonest tree is predicted). To be really hard-nosed, one resamples 1000 times and looks for the same prediction 950 times or more (the statistician′s 95% confidence limit). However, lower levels can be taken as a strong likelihood.

To bootstrap resample, with the same aligned sequences, go to the Trees menu, select the Clustal format output option, and then Bootstrap nj tree. This outputs a tree with the .njb file extension. The file is just the same as the .nj format, but at the end it has the bootstrap results, e.g.

********************************************..** 994

*******************************************...** 1000

**********************************************.. 1000

****************************************..****** 1000

************************************..********** 898

*************..********************************* 1000

************************************...********* 945

*******..*************************************** 397

******...*************************************** 1000

****************************************..*...** 816

***********************************....********* 996

****************************************......** 1000

****..****************************************** 1000

***********************************.....******** 1000

************...********************************* 797

************....******************************** 1000

****************************************........ 1000

******************.*******.********************* 1000

****************.********.********************** 1000

****.....*************************************** 983

*********..************************************* 932

.*.********************************************* 550

.*..******************************************** 445

*********..*....******************************** 546

.*.......*************************************** 301

***********************************............. 628

*********.......******************************** 994

.........*************************************** 422

.........**************************............. 866

................*******************............. 999

*****************..*******.********************* 1000

*******************..*************************** 883

................**************.****............. 667

*******************...************************** 415

*******************....************************* 999

................**************.***.............. 997

***************************.*.****************** 721

................***....*******.***.............. 396

************************.*******.*************** 427

***************************...****************** 408 ***************************...*.**************** 446 .................**....**.****.***.............. 197

***********************..*******.*************** 185

*****************..*******.******.************** 138 .......................**..***.**............... 72 111111111111111111111112211333132111111111111111

The columns of stars and dots in the table represent the sequences in the dataset, 1 to n from left to right. Each row represents the separation of the sequences into two groups (clades), the stars and the dots. The branch tahat separates the star clade from the dot clade occurs in the resampled trees the number of times indicated at the right end of each line out of the total number of resamplings. Thus the validity of any predicted branch can be quantified.

Displaying and editing trees with Treeview Treeview is a program which converts tree files of several forms (including the .ph output from Clustal) into a graphical format, drawing the tree in one of three formats, permitting editing (swapping branches, changing font styles and sizes, etc.). It thus eliminates manual drawing of trees and produces publication-quality output. Treeview is accessible in ISS clusters from Genetics Departmental Software. In the M Res cluster it is on the C: drive in Biology SoftwareTreeview

Although it is alrady installed on the cluster, there is a web site from which Treeview may be downloaded onto your own computer.

Start up Treeview, and use it to draw trees for the three hemoglobin datasets that you aligned. For each, use the options to clearly display the trees, and print them out.

The key to sequence codes is:

First Part of code:

MYG = myoglobin

HBA = hemoglobin α

HBAT = hemoglobin θ

HBAZ = hemoglobin ζ

HBB = hemoglobin β

HBG = hemoglobin γ

HBD = hemoglobin δ HBE = hemoglobin ε

 

Second part of code:

HUMAN = human - primate mammal

BOVIN = cattle - placental mammal

MACxx = kangaroo - marsupial mammal

CHICK = chicken - bird

ALLxx = alligator - reptile

ENLA = toad - amphibian

THUxx = tuna - bony fish HETPO = shark - cartilaginous fish

Then look at the tree you have drawn for someglobins. Because this contains some orthologs and some homologs, the evolutionary relationships are unexpected to say the least. It shows three main groups:

cow, shark and toad human and bony fish shark, chick and kangaroo This is a most unlikely evolutionary tree. Explain these anomalous results.

You will find that it is helpful to look at the tree from globins and see where the someglobins sequences fit. This illustrates the problems which arise with a mixed subset of homologs and orthologs, but which become clear when all homologs and orthologs are included.

Finally, look at the humanglob tree to look at a set of entirely human orthologs, the results of a series of gene duplications. This shows duplications in both the α and β lines. On the basis of the tree, describe the various gene duplucations which have occurred to generate the current human gene set.


顶一下
(0)
0%
踩一下
(0)
0%
------分隔线----------------------------
最新评论 查看所有评论
发表评论 查看所有评论
请自觉遵守互联网相关的政策法规,严禁发布色情、暴力、反动的言论。
评价:
表情:
用户名: 密码: 验证码:
发布者资料
admin 查看详细资料 发送留言 加为好友 用户等级: 注册时间:2008-12-30 06:12 最后登录:2008-12-31 04:12