Answers for Work-alone Exercises

 

 


1.     We discussed the fact that, surprisingly, the information content of informational molecules (DNA or RNA) is quite low, since I = log2 (p1/p0) = log2 4 = 2 bits/base.

 

A.   Using the same relationship, determine the information content contained in the 20 amino acids that constitute proteins.

 

Answer:  I = log2 (p1/p0) = log2 20 = 4.3 bits/amino acid

 

B.    Compare the value determined in part A with the value of 6 bits/amino acid when codons are considered (I = log2 64), and resolve this difference by suggesting additional factors to be considered in computing the information content of codons. [Shouldn’t they be the same, regardless of the approach?]

 

Answer: Whether determined on the basis of codons (64) or amino acids (20), the value for I should be the same and, in fact, it is (ca. 4 bits/amino acid). Obviously, the “crude” estimates computed as log2 (p1/p0) deviate significantly from the “true value,” since, in both instances, p0 does not reflect events of equal probability. In the case of codons, for example, only 61 code for amino acids, plus there is redundancy associated with synonymous codons, which is further complicated by codon preference, etc. For proteins, the major problem with p0 is the fact that some amino acids (e.g., Gly and Ala) are used frequently, whereas others (e,g., Met and Trp) are not.

 

2.     Align CAAGT (i, vertical sequence) with CCAAT (j, horizontal sequence) using the dynamic programming approach. Populate the identity matrix using a value of 1 for a match and 0 for a mismatch; no gap penalty. Determine the scoring matrix and alignments by traceback.

 

Answer:

 

 

C

C

A

A

T

C

1

1

0

0

0

A

0

0

1

1

0

A

0

0

1

1

0

G

0

0

0

0

0

T

0

0

0

0

1

 

 

 

C

C

A

A

T

C

1

1

0

0

0

A

0

1

2

2

1

A

0

1

2

3

2

G

0

1

1

2

3

T

0

1

1

2

4

 

 

 

C

C

A

A

T

C

1

1

0

0

0

A

0

1

2

2

1

A

0

1

2

3

2

G

0

1

1

2

3

T

0

1

1

2

4

 

 

C C A A – T

  | | |   |

. C A A G T

 

 

C

C

A

A

T

C

1

1

0

0

0

A

0

1

2

2

1

A

0

1

2

3

2

G

0

1

1

2

3

T

0

1

1

2

4

 

 

C C A A T

|   |   |

C A A G T

 

3.     The table below illustrates actual permissible amino acid substitutions for positions 75 through 81 of the lambda (l) repressor:

 

Position:

Wild-type Residue:

Permissible Substitutions:

75

Glu

Asp, Gln, Ser, Thr, Ala

76

Phe

 

77

Ser

Ala

78

Pro

 

79

Ser

Arg, Lys, Asp, Gln, Glu, His, Tyr, Thr, Cys, Gly, Ala

80

Ile

Lys, Cys, Met, Leu

81

Ala

Ser

 

A.   On the basis of the PAM250 matrix, how likely are these substitutions?

 

Answer:

Position:

Wild-type Residue:

Permissible Substitutions:

75

Glu

Asp (3), Gln (1), Ser (0), Thr (0), Ala (0)

76

Phe

 

77

Ser

Ala (1)

78

Pro

 

79

Ser

Arg (0), Lys (0), Asp (0), Gln (-1), Glu (0), His (-1), Tyr (-3), Thr (1), Cys (0), Gly (0), Ala (1)

80

Ile

Lys (-2), Cys (-2), Met (2), Leu (2)

81

Ala

Ser (1)

 

The table above provides PAM250 values (in parentheses) for each permissible substitution. Please note that these are log values, so that “1” is 10 times more likely than “0.”

 

Position 75: The most likely substitution, Glu ® Asp, is plausible since both residues are acidic. Interestingly, Glu ® Gln is also relatively frequent, but the amide form of Asp, i.e., Asn, is not.

 

Position 76: There are no substitution mutants for Phe. In the context of protein structure, therefore, Phe must occupy a position with high spatial constraints.

 

Position 77: Only one substitution is found, Ser ® Ala, both having small side-chains. Ser, therefore, also appears to be spatially constrained.

 

Position 78: There are no permissible substitutions for Pro. In this case, the unique structure of this imino acid is likely to be very significant.

 

Position 79: In contrast to Ser (Position 77), there are many substitution mutants observed for this position. Predominant among these is Thr, Gly or Ala, all with small side-chains. It is also plausible to assume that this position is more spatially accessible than position 77.

 

Position 80: Four substitutions are permitted, in which Ile ® Met and Ile ® Leu predominate, all relatively hydrophobic residues.

 

Position 81: Same as for position 77.

 

 

B.    Using single-letter amino acid codes, define residue positions 75 through 81 as a Prosite pattern.

 

Answer:  [EDQSTA]-F-[SA]-P-[SRKDQEHYTCGA]-[IKCML]-[AS]

 

Alternative: [EDQSTA]-F-[SA]-P-{FILMNPVW}-[IKCML]-[AS]

 

C.   The Prosite pattern for the prenyl binding site is designated as: C-{DENQ}-[LIVM]-x>

“Translate” this designation into plain English.

 

Answer: The prenyl binding site consists of the last four residues of the C-terminus, i.e., Cys, followed by any residue except Asp, Glu, Asn or Gln, followed by Leu, Ile, Val or Met, and ending with any residue.

 

D.   Why is the weight matrix approach (e.g., BLOCKS) more sensitive in detecting motifs, as compared to text-matching (e.g., PROSITE)?

 

Answer: Text-matching algorithms employ an “either/or” scoring principle. PROSITE, for example, searches the submitted sequence one position at a time for each pattern in its database. If any position fails to match exactly, the pattern is rejected and the next pattern is searched, etc.

     The position-specific weight matrix approach, by contrast, gains its increased sensitivity by assigning numerical values to each position within a potential pattern, and uses the sum of these values to decide whether the pattern exists in the submitted sequence, i.e., > cutoff value. This type of algorithm, therefore, applies “fuzzy logic.”

 

4.     From your sequencing of genomic DNA libraries prepared from post-mortem brain tissue samples obtained from average and creative people, you believe you have identified a short promoter region responsible for creativity. Preliminary data indicate that this sequence is unique to creative individuals and occurs approximately 150 bases upstream of the start signal for a specific class of mRNA transcripts. You now want to use your aligned sequences (below) to develop a consensus sequence from the weight matrix for database searching.

1 2 3 4 5

C A A G T

C C A A T

G A A G T

A A A T T

G C T G A

G C A A T

C G A A T

G A T A T

C C A G G

G A A A T

 

A.   From the aligned sequences, prepare a frequency matrix for each position, 1 through 5.

 

 

1

2

3

4

5

 

A

1

5

8

5

1

20

C

4

4

0

0

0

8

G

5

1

0

4

1

11

T

0

0

2

1

8

11

 

10

10

10

10

10

50

 

B.    From the frequency matrix, derive a weight matrix for each position.

 

 

1

2

3

4

5

A

-1.39

0.223

0.693

0.223

-1.39

C

0.916

0.916

-1.16

-1.16

-1.16

G

0.82

-0.79

-1.48

0.60

-0.79

T

-1.48

-1.48

-0.10

-0.79

1.29

 

C.   From the weight matrix, provide an ambiguous consensus sequence, as well as a consensus using international codes for ambiguous base assignments.

 

1

2

3

4

5

G/C

A/C

A

A/G

T

S

M

A

R

T

 

D.   Use the weight matrix to determine the location of the 5-base promoter in the 20-base sequence below from a creative individual. Use a “window” of 5 bases and an increment of one. For i = 1 to 16, determine each score by summation (i, i + 1, i + 2, i + 3, i + 4) and divide the total by the window size (5 in this instance).

 1   5    10 11  15   20

|   |    |  |   |    |

GGTCGGAGCA  GTTTAATGGT

 

1   5    10 11  15   20

|   |    |  |   |    |

GGTCGGAGCA  GTTTAATGGT

 

 

5.     In Michael Crichton’s fictional account, Jurassic Park, InGen’s chief scientist, Dr. Henry Wu, provides a segment of dinosaur DNA and a 100 base portion is reproduced below:

 

 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC CCCCCTGACG AGCATCACAA

51 AAATCGACGC GGTGGCGAAA CCCGACAGGA CTATAAAGAT ACCAGGCGTT

 

Using this sequence as a database query, determine its true origin. [Recommendation: BLASTN search using www.ncbi.nlm.nih.gov/]

 

Answer: The “dinosaur” DNA is actually a vector sequence. Not very imaginative!!

 

 

6.     The following problem is loosely derived from a real-life phylogenetic study of molluscs. The results that you derive, however, are not meant to imply the true nature of the evolutionary relationships among the three species described here!

 

A 300bp region of the D-loop of the 28S ribosomal RNA gene was sequenced for 3 species of molluscs: the bivalve Mytilus edulis, the cephalopod Nautilus pompilius, and the gastropod Truncatella pulchella. The observed differences in sequences for the species are shown below:

 

Mytilus edulis vs. Nautilus pompilius                         23bp out of 300

Nautilus pompilius vs. Truncatella pulchella              45bp out of 300

     Truncatella pulchella vs. Mytilus edulis                      45bp out of 300

 

Calculate D for each species and apply the values to a distance matrix.

 

Calculate K from the Jukes-Cantor equation and draw a cladogram showing branch distances.

 

Answer:

 

Let M. edulis be species A; therefore D for A vs. B = 23/300 = 0.077

Let N. pompilius be species B; therefore D for B vs. C = 45/300 = 0.150

Let T. pulchella be species C; therefore D for A vs. C = 45/300 = 0.150

 

A distance matrix would look like this:

 

 

A

B

C

A

*

0.077

0.150

B

0.077

*

0.150

C

0.150

0.150

*

 

 

And the K values from the Jukes-Cantor equation would look like:

 

K (A vs. B) = -3/4 ln [1- 4/3 (0.077)] = 0.0812

K (B vs. C) = -3/4 ln [1- 4/3 (0.150)] = 0.167

          K (A vs. C) = -3/4 ln [1- 4/3 (0.150)] = 0.167

 

Finally, a reasonable cladogram might look like this:

 

 

7. GENOMIC BIOLOGY

Go to NCBI and request AB035301 Genbank accession; you’ll see that this is the sequence for cadherin 7. Go to SMART (http://smart.embl-heidelberg.de) and enter the accession number above and press the “Sequence Smart” button. Wait to see the modular domain structure illustration. Go to the “Display all proteins with similar domain composition” to see a listing of more cadherins with links to their SWISSPROT entries. Go back to the modular illustration and click on one of the CA domains. Then click on “Evolution” at the top of the page to view information about the representation of CA domains in other species. Finally, go back and click on the “Structure” link at the top of the page for links to PDB 3-D structures in the Molecular Modeling Database (MMDB) at NCBI. If you download their new viewer, you will be able to see the structure of molecules that contain the CA domain. Pretty neat!

 

Answer: Provided as output for the final Computer Laboratory Exercise.