• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/66

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

66 Cards in this Set

  • Front
  • Back

Alignment Algorithms

1.Global Alignment


2.Local Alignment

Methods of Alignment

1. Dot Plot


2. Dynammic Programming


3.Word Method

Dot Plot Mmatrix Method

A basic Sequence Alignment Method


A graphical way of comparing two sequences in a two dimensional matrix


Both sequences are written on the vertical and horizontal axes of the matrix


a dot is placed within the graph when residues match , otherwise the position is left blank


When the sequences have areas of similarity, there can be seen dots that form diagonal lines . The interruption between the diagonals are areas of deletion and insertion



Parallel diagonals represent repetitive regions of the sequences

Problems attached with dot Plots

1.When comparing large sequences, there can be seen a high noise level.

Solving the Problem of Dotplot

A window slide of fixed length


this scans accross the two sequences and compares all possible matches.


The size of the window can be manipulated.


Sensitivity is lost if the window size is too long.

Alignment of a sequence on itself

Why: to identify regions of internal repeats elements.


There is a perfect diagonal for matching residues .


If the repeats are present, there are short diagonals below and above the main diagonal

What can be found via Dotplots using Self complimentarity of DNA sequences

inverted Repeats


This method can be applied thus in Genomics

Another Problem of Dotplot

It lacks statistical Rigor in the assessing of an alignment.


The method is also restricted to pairwise alignment

Progrmans for Dotplot

Dotmatcher and Dottup

Dotmatcher

1. Displays Dotplots of aligned sequences in FASTA Format

Dottup

aligns sequences based on the Word Method .Diagonal lines are only drawn, if exact matching of words of specific lengths are found.

Dynammic Programming

It is similar to Dotplots, however it incoperates scoring schemes and matrices ot the alignment and assesment of the alignment. Iit searches fo an alignment with the highest score, thus providing the best option.

Gaps

Represent Deletion or Insertion

Affine Gap Penalties

The differential Gap Penalties: Gap Oopening Penalty and Gap Extension Penalty

The total Gap penalty is a linear function of what

the gap length

Needleman Wunsch Algorithm

The global alignment using Dynamic Programming

Dynammic Programming for Local Alignment

Smith Waterman Algorithm


Positive Scores are assigned to matches, zeros to missmatches and gaps

Scoring Matrices

or a Substitution Matrix is used for the Anaylsis of residue Substitution

observation in substitution Matrices

1. Transition[ Ssubstitution purines to purines or pyrimidines to pyrimidines ] occur more frequently than Transversions [purines to pyrimidines ]

Ccomplexity of Scoring Matrices for Amino Acid

They are more Complex, because amino acids are scored based on their physiochemical properties

Amino Acid Scoring Matrices

20*20 Matrices


there are two types ;


a. one is based on interchangeability of the genetic code or amino acid properties


b. the other is derived from empirical studies of amino acid substitution

Empirical A A scoring Matrices are ?

PAM and BLOSUM


are derived form actual alignments of high similar sequences

how can a scoring system be developed ?

by giving a high score to a more likely substitution and a low one to a rare subsitution

a positive score means?

the frequency of substitution is higher than one would expect by random chance

Score zero?

the frequency is equal to random chance

negative score ?

the frequency of the substitution is lower than one would expect randomly.

log odds ratio

are logarithmic ratios of observed mutation frequency divided by the probability of substitution one would expect by random chance

PAM Matrices

Point Accepted Mutation Matrices


Point Mutations that are acceptedby natural selection

One PAM unit means?

1 % of the the amino acid positions have been changed or one mutation per 100 residues

PAM80 is produced by?

multiplying PAM1 by itself 8 times.

how is a PAM1 substituion Table constructed?

a group of closely related sequences of mutation frequences corresponding the PAM1 unit are chosen.

BLOSUM Matrices

Blocks amino acid substitution matrices


they are percentage identity values of sequences selected for the construction of these matrices

how were the blocks constructed ?

based on more than 200 amino acid conserved patterns and 500 groups of protein sequences



Blocks are ungapped alignments of less than 60 residues in length


The frequencies of amino acid substitution of the residues in these Blocks are calculated to produce the table.



Comparison between PAM and BLOSUM

PAM matrices, except PAM1, are derived from evolutionary model, whereas BLOSUM matrices consist of entirely direct observations.



BLOSUM matrices may have less evolutionary meaning than PAM


That is why PAM is used more for constructing phylogentic trees



However because of the mathmatical exptrapolation used for PAM Matrices, they may be less significant for more divergent sequences.


2. BLOSUM matrices are derived from local sequence alignments of conserved sequence blocks.


whereas PAM1 is based on Global Alignment of full length sequences composed of conserved and variable regions


P value

is given to indicate the probability that the original alignment is due by random chance


if the value is less than 10 -100, it indicates an exact match between both sequences. If higher, then both sequences are considered to be identical . Avalue ranging between 10-5 and 10-1 indicates distant homologs.

Heuristic Algorithms

BLAST and FASTA


50-100 times faster than dynamic programming


heuristic word methods

BLAST

Steps :


1. Query sequnce broken down to words [three residues for protein sequences and eleven for DNA residues ]


2. Scans for matches against the database sequences


3. this includes words with one or two letter matches


4.Calculates score of matches based on BLOSUM 62


5.Extension of both sides until the score of the alignment drops below threshold score.


6.Determine high scored segment above threshold score



In the original BLAST, the HSPs are presented as Final Report. and are called maximun scoring segment pairs



However in the new improvement, gapped alignment is presented.

Statistical Significance in BLAST

is presented as the Ee-value ; Expectation value, which is the probability that the resulting alignments from a database search are caused by random chance



E = m*n*P



m; total number of residues in the database


n;number of residues in the query sequence


P;Probability that the HSP is that of random chance


e.g. aligning a query sequence of 100 residues to a database of 10 raised to 12 residues results in a P value for the ungapped HSP region in one of the database match to 1*1 raised to -20 . The E value is thus 10 raised to -6


the lower the value, the more significant it is .

FASTA

FASTA uses hashing strategy to find matches for short stretches of identical residues of length k, known as ktuples


2 residues for a protein sequence and 6 residues for DNA : in other words , shrter than the words in BLAST.

Steps in FASTA

1. Identify ktups between two sequences using the hashing strategy.


this works by construtuing a tablethat shows the position of each ktup for the two sequences.


The positional difference can be obtained for each word by substracting the position on the first sequence from the position on the second sequence, which is represented as the offset.


2. Ktups with identical offset values contain contigious identical sequence regions that corresponds to a diagonal stretch on a 2d matrix.


3. the top ten highest desity diagonals are pciked and emphasized, which are then scored via a substitution matrix.


4. naeibouring high score segments are joined together to form a signle alignment. the score of the gapped alignment allows incorperation og gap penalities when scoring again,


5. the alignment is the refined using the smith watermann algorithm. basically for statistical evaluation, E-score.

Comparision of FASTA to BLAST

1. the seeding step


BLAST uses a substitution matric to find matching words , while FASTA identifies matching words using the hashing method


2. FAST by default scans smaller window sizes, thus giving it more sensititvity .


3. FASTA is slower than BLAST,


4.FASTA gives only one final alignment, while BLAST presents multiple best scoring alignments.

Multiple Sequence Alignment

Allows the identification of conserved regions and motifs in a whole sequence family and essential in carrying out phylogenetic analysis of sequence familes and prediction of the protein secondary and tertiary structure

Concept of MSA

the sequences are arranged in such a way that there is a maximum number of residues are matched up according to a particular scoring function

The scoring function fo MSA based on what?

the sum of SP's


SP is the sum of all scores of all possible pairs of sequences in multiple alignment based on a specific scoring matrix. this alignment is pairwise considering also the matches, missmatches and gapcosts.

Approaches of MSA

1.Heuristic Algorithms


- Progressive Alignment type, iterative alignment type, block based alignment.

Progressive Alignment Method

A multistep process


I tfirst conducts pairwise alignment based on the needleman wunsch algorithm and then records the similarit scores,

to align additional sequences via the PA method?

the two already aligned sequences are converted to a consesus sequence with gap positions. this is treated as a single sequence in the next step.

the most nown PA program

CLUSTAL


CLUSTALW; the Ww provides a simple text based interface .

Advantages of CLUSTAL

1. it does not use only one substitution matrix, instead it applies different scoring matrices when aligning the sequences.


The choice of the matrix depends on the evlolutionary distance measured from the guide tree


e.g. for closely related sequences, CLUSTAL uses BLOSUM62 or PAM120 matrix. But for more divergent sequences, BLOSUM45 or PAM250 is preferred.


2. it uses adjustable gap penalties, which allow more deletions and insetions outside regions of conservation , but fewer in conserved regions.

Disadvantages of PA Method

1. not suitable for the multiple aligment of sequences of diffferent lengths because it is a global alignment based method.


2. and as a result of the use affin gap penalties, long gaps are not allowed.


3.optimal result at the end of the alignment cannot be promised. Because at the intial stage alignment, once done, erros made cannot be corrected. thus there is a build up of errors with successive alignments.

Improvement in the Clustal

T-Coffee


Performs both local and global alignment.


Because an optimal alignment is chosen at the intial stage, T-coffee avoids or minimizes errorsin the early stages.


However it is slower than LUSTAL, because of the computation cost which are high.

Evaluation of Alignments gotten from PA

Editiing: This involves introducing or removing gaps to maximize biologicallymeaningful matches. thus avoided missaligned portions.



BioEdit-porgram

Computational Approaches to Protein three dimensional structural modeling and prediction

1.Homology Modeling 2.Threading 3.Ab initio Prediction



the first two are knowledge based : They model structures based on knowledge of existing protein structural information in databases

Homology Modeling Overview

builds an atomic model based on an experimentally determinedstructure that is closely related at the sequence level

Threading overview

identifies Protein that are structurally similar with or without detectable sequence similarites.

Ab initio Prediction overview

Predicts and models structures based on physiochemical principles that govern protein folding without the use of structural templates

Homology Modeling

also known as comparative modeling


Principle : If two sequences share a high sequence similarity, then are most likely to share the 3d structure .

Steps in Homology Modeling

1.template Selection -Identification of homologous sequences in a database that would be used for modeling


2. Alignment of target and template sequences.


3.build a frame work structure for the target protein consisting of main chain atoms


4.Addition and optimization of side chain atoms and loops


5.Energy optimization


6. Evaluation of the overall quality of the model

Template selection

1. involves searching the protein Data Bank for homologous proteins with determined structures


this search can be performed using heuristic methods e.g. BLAST, FASTA



rule of thumb :A database protein must have at least 30% sequence identity with the query sequence in order to be accepted as template.

Sequence Alignment

Once the structure with the highest identity has been located,the full length sequence of the target proteins and template need to be aligned via refined algorithms to obtain optimal alignment.




Because incorrect alignment leads to incorrect designation or residues and therfore incorrect models.


T-coffee is a suitable algorithm.

Loop Modeling

in the Sequence alignment for modeling, there are often regions of deletions and insertions, gaps, in the alignment. Cclosing the gaps require loop modoeling., which is very difficult and also a major cause of errors .

Side chain Modeling

Once the model of the main chain atoms are built, the positions ofthe side chain atoms are to be determined,



Side chain geometry moeling is important in determining protein -ligsnd interactions at active sites



A side chain can be built by searching every possible conformation at every torsional angle of the side chain to select the ones with the lowest interaction energy with neigbouring atoms



most side chain modeling programs use the concept of, rotamer,which are favored side chain torsional angles extracted from known protein crystal structures. only the possible rotamers with the lowest energy are selected

Mdel Refinement

1.Energy minimization


2.molecular dynamics simulation


the simulation can be done in a vacuum or solvents.

Why are protein structures more conserved than protein sequences ?

Because there is only a small number of protein folds available in comparision to the numerous posibble protein sequences



As a result some protein share the same fold in absence of similar sequence identity

Threading or structural fold recognition

predicts the structural fold of an unknown protein sequence by fitting the sequence into a structural database and selecting the best fitting fold.


Methods : 1. Pairwise Energy Method


2.Profile Method

Pairwise Energy Method

a Protein sequence is searched for in a structural fold database to find the best fitting matching structural fold using energy based criteria