Answers for Work-alone
Exercises
1.
We
discussed the fact that, surprisingly, the information content of informational
molecules (DNA or RNA) is quite low, since I = log2 (p1/p0)
= log2 4 = 2 bits/base.
A.
Using
the same relationship, determine the information content contained in the 20
amino acids that constitute proteins.
Answer: I = log2 (p1/p0)
= log2 20 = 4.3 bits/amino acid
B.
Compare
the value determined in part A with the value of 6 bits/amino acid when codons
are considered (I = log2 64), and resolve this difference by
suggesting additional factors to be considered in computing the information
content of codons. [Shouldn’t they be the same, regardless of the approach?]
Answer: Whether determined
on the basis of codons (64) or amino acids (20), the value for I should be the
same and, in fact, it is (ca. 4 bits/amino acid). Obviously, the “crude”
estimates computed as log2 (p1/p0)
deviate significantly from the “true value,” since, in both instances, p0
does not reflect events of equal probability. In the case of
codons, for example, only 61 code for amino acids, plus there is redundancy
associated with synonymous codons, which is further complicated by codon
preference, etc. For proteins, the major problem with p0 is
the fact that some amino acids (e.g., Gly and Ala) are used frequently, whereas
others (e,g., Met and Trp) are not.
2.
Align
CAAGT (i, vertical sequence) with CCAAT (j, horizontal sequence) using the
dynamic programming approach. Populate the identity matrix using a value of 1
for a match and 0 for a mismatch; no gap penalty. Determine the scoring matrix
and alignments by traceback.
Answer:
|
C |
C |
A |
A |
T |
C |
1 |
1 |
0 |
0 |
0 |
A |
0 |
0 |
1 |
1 |
0 |
A |
0 |
0 |
1 |
1 |
0 |
G |
0 |
0 |
0 |
0 |
0 |
T |
0 |
0 |
0 |
0 |
1 |
|
C |
C |
A |
A |
T |
C |
1 |
1 |
0 |
0 |
0 |
A |
0 |
1 |
2 |
2 |
1 |
A |
0 |
1 |
2 |
3 |
2 |
G |
0 |
1 |
1 |
2 |
3 |
T |
0 |
1 |
1 |
2 |
4 |
|
C |
C |
A |
A |
T |
C |
1 |
1 |
0 |
0 |
0 |
A |
0 |
1 |
2 |
2 |
1 |
A |
0 |
1 |
2 |
3 |
2 |
G |
0 |
1 |
1 |
2 |
3 |
T |
0 |
1 |
1 |
2 |
4 |
C C A A – T
| | | |
. C A A G T
|
C |
C |
A |
A |
T |
C |
1 |
1 |
0 |
0 |
0 |
A |
0 |
1 |
2 |
2 |
1 |
A |
0 |
1 |
2 |
3 |
2 |
G |
0 |
1 |
1 |
2 |
3 |
T |
0 |
1 |
1 |
2 |
4 |
C C A A T
| | |
C A A G T
3.
The
table below illustrates actual permissible amino acid substitutions for
positions 75 through 81 of the lambda (l) repressor:
Position: |
Wild-type
Residue: |
Permissible
Substitutions: |
75 |
Glu |
Asp, Gln, Ser, Thr, Ala |
76 |
Phe |
|
77 |
Ser |
Ala |
78 |
Pro |
|
79 |
Ser |
Arg, Lys, Asp, Gln, Glu, His, Tyr, Thr, Cys, Gly,
Ala |
80 |
Ile |
Lys, Cys, Met, Leu |
81 |
Ala |
Ser |
A.
On
the basis of the PAM250 matrix, how likely are these substitutions?
Answer:
Position: |
Wild-type Residue: |
Permissible Substitutions: |
75 |
Glu
|
Asp (3), Gln (1), Ser (0), Thr (0), Ala
(0) |
76 |
Phe |
|
77 |
Ser |
Ala (1) |
78 |
Pro |
|
79 |
Ser |
Arg (0), Lys (0), Asp (0), Gln (-1), Glu
(0), His (-1), Tyr (-3), Thr (1), Cys (0), Gly (0), Ala (1) |
80 |
Ile |
Lys (-2), Cys (-2), Met (2), Leu (2) |
81 |
Ala |
Ser (1) |
The table above provides PAM250 values (in parentheses) for each permissible substitution. Please note that these are log values, so that “1” is 10 times more likely than “0.”
Position 75: The most likely
substitution, Glu ® Asp, is plausible
since both residues are acidic. Interestingly, Glu ® Gln is also
relatively frequent, but the amide form of Asp, i.e., Asn, is not.
Position 76: There are no
substitution mutants for Phe. In the context of protein structure, therefore,
Phe must occupy a position with high spatial constraints.
Position 77: Only one
substitution is found, Ser ® Ala, both having
small side-chains. Ser, therefore, also appears to be spatially constrained.
Position 78: There are no
permissible substitutions for Pro. In this case, the unique structure of this imino
acid is likely to be very significant.
Position 79: In contrast to
Ser (Position 77), there are many substitution mutants observed for this
position. Predominant among these is Thr, Gly or Ala, all with small
side-chains. It is also plausible to assume that this position is more
spatially accessible than position 77.
Position 80: Four
substitutions are permitted, in which Ile ® Met and Ile ® Leu predominate,
all relatively hydrophobic residues.
Position 81: Same as for
position 77.
B.
Using single-letter amino acid codes, define residue positions 75 through 81 as a
Prosite pattern.
Alternative:
[EDQSTA]-F-[SA]-P-{FILMNPVW}-[IKCML]-[AS]
C.
The
Prosite pattern for the prenyl
binding site is designated as: C-{DENQ}-[LIVM]-x>
“Translate” this designation
into plain English.
Answer: The prenyl
binding site consists of the last four residues of the C-terminus, i.e., Cys,
followed by any residue except Asp, Glu, Asn or Gln, followed by Leu, Ile, Val
or Met, and ending with any residue.
D.
Why
is the weight matrix approach (e.g., BLOCKS) more sensitive in detecting
motifs, as compared to text-matching (e.g., PROSITE)?
Answer: Text-matching
algorithms employ an “either/or” scoring principle. PROSITE, for example,
searches the submitted sequence one position at a time for each pattern in its
database. If any position fails to match exactly, the pattern is rejected and
the next pattern is searched, etc.
The position-specific weight matrix
approach, by contrast, gains its increased sensitivity by assigning numerical
values to each position within a potential pattern, and uses the sum of these
values to decide whether the pattern exists in the submitted sequence, i.e.,
> cutoff value. This type of algorithm, therefore, applies “fuzzy logic.”
4.
From
your sequencing of genomic DNA libraries prepared from post-mortem brain tissue
samples obtained from average and creative people, you believe you have
identified a short promoter region responsible for creativity. Preliminary data
indicate that this sequence is unique to creative individuals and occurs
approximately 150 bases upstream of the start signal for a specific class of
mRNA transcripts. You now want to use your aligned sequences (below) to develop
a consensus sequence from the weight matrix for database searching.
1 2 3 4 5
C A A G T
C C A A T
G A A G T
A A A T T
G C T G A
G C A A T
C G A A T
G A T A T
C C A G G
G A A A T
A.
From
the aligned sequences, prepare a frequency matrix for each position, 1
through 5.
|
1 |
2 |
3 |
4 |
5 |
|
A |
1 |
5 |
8 |
5 |
1 |
20 |
C |
4 |
4 |
0 |
0 |
0 |
8 |
G |
5 |
1 |
0 |
4 |
1 |
11 |
T |
0 |
0 |
2 |
1 |
8 |
11 |
|
10 |
10 |
10 |
10 |
10 |
50 |
B.
From
the frequency matrix, derive a weight matrix for each position.
|
1 |
2 |
3 |
4 |
5 |
A |
-1.39 |
0.223 |
0.693 |
0.223 |
-1.39 |
C |
0.916 |
0.916 |
-1.16 |
-1.16 |
-1.16 |
G |
0.82 |
-0.79 |
-1.48 |
0.60 |
-0.79 |
T |
-1.48 |
-1.48 |
-0.10 |
-0.79 |
1.29 |
C.
From
the weight matrix, provide an ambiguous consensus sequence, as well as a
consensus using international codes for ambiguous base assignments.
1 |
2 |
3 |
4 |
5 |
G/C |
A/C |
A |
A/G |
T |
S |
M |
A |
R |
T |
D.
Use
the weight matrix to determine the location of the 5-base promoter in the
20-base sequence below from a creative individual. Use a “window” of 5 bases
and an increment of one. For i = 1 to 16, determine each score by summation (i,
i + 1, i + 2, i + 3, i + 4) and divide the total by the window size (5 in this
instance).
1 5 10 11
15 20
|
| | | | |
GGTCGGAGCA GTTTAATGGT
1
5 10 11 15
20
|
| | | | |
GGTCGGAGCA
GTTTAATGGT
5.
In
Michael Crichton’s fictional account, Jurassic
Park, InGen’s chief scientist, Dr. Henry Wu, provides a segment of dinosaur
DNA and a 100 base portion is reproduced below:
1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC
CCCCCTGACG AGCATCACAA
Using this sequence as a
database query, determine its true origin. [Recommendation: BLASTN
search using www.ncbi.nlm.nih.gov/]
Answer: The “dinosaur” DNA
is actually a vector sequence. Not very imaginative!!
6.
The
following problem is loosely derived from a real-life phylogenetic study of
molluscs. The results that you derive, however, are not meant to imply the true
nature of the evolutionary relationships among the three species described
here!
A 300bp region of the D-loop
of the 28S ribosomal RNA gene was sequenced for 3 species of molluscs: the
bivalve Mytilus edulis, the
cephalopod Nautilus pompilius, and
the gastropod Truncatella pulchella.
The observed differences in sequences for the species are shown below:
Mytilus edulis vs. Nautilus pompilius 23bp
out of 300
Nautilus pompilius vs. Truncatella pulchella 45bp out of 300
Truncatella pulchella vs. Mytilus edulis 45bp out of 300
Calculate D for each species and apply the values to
a distance matrix.
Calculate K from the Jukes-Cantor equation and draw
a cladogram showing branch distances.
Answer:
Let M.
edulis be species A; therefore D for A vs. B = 23/300 = 0.077
Let N.
pompilius be species B; therefore D for B vs. C = 45/300 = 0.150
Let T.
pulchella be species C; therefore D for A vs. C = 45/300 = 0.150
A distance matrix would look like this:
|
A |
B |
C |
A |
* |
0.077 |
0.150 |
B |
0.077 |
* |
0.150 |
C |
0.150 |
0.150 |
* |
And the K values from the Jukes-Cantor
equation would look like:
K (A vs. B) = -3/4
ln [1- 4/3 (0.077)] = 0.0812
K (B vs. C) = -3/4
ln [1- 4/3 (0.150)] = 0.167
K (A vs. C) = -3/4 ln [1- 4/3 (0.150)] =
0.167
Finally, a
reasonable cladogram might look like this:
7. GENOMIC BIOLOGY
Go to NCBI and request AB035301 Genbank accession;
you’ll see that this is the sequence for cadherin 7. Go to SMART (http://smart.embl-heidelberg.de)
and enter the accession number above and press the “Sequence Smart” button.
Wait to see the modular domain structure illustration. Go to the “Display all
proteins with similar domain composition” to see a listing of more cadherins with
links to their SWISSPROT entries. Go back to the modular illustration and click
on one of the CA domains. Then click on “Evolution” at the top of the page to
view information about the representation of CA domains in other species.
Finally, go back and click on the “Structure” link at the top of the page for
links to PDB 3-D structures in the Molecular Modeling Database (MMDB) at NCBI.
If you download their new viewer, you will be able to see the structure of
molecules that contain the CA domain. Pretty neat!
Answer: Provided as output for the final
Computer Laboratory Exercise.