RNA Structure and Prediction
Computational Molecular Biology (BIO502)
M. Nelson and S. Istrail
RNA folding
- RNA is transcribed (or synthesized) in cells as single strands of (ribose) nucleic acids. However, these sequences are not simply long strands of nucleotides. Rather, intra-strand base pairing will produce structures such as the one shown below.
In RNA, guanine and cytosine pair (GC) by forming a triple hydrogen bond, and adenine and uracil pair (AU) by a double hydrogen bond; additionally, guanine and uracil can form a single hydrogen bond base pair.
The stability of a particular secondary structure is a function of several constraints:
- The number of GC versus AU and GU base pairs.
(Higher energy bonds form more stable structures.) - The number of base pairs in a stem region.
(Longer stems result in more bonds.) - The number of base pairs in a hairpin loop region.
(Formation of loops with more than 10 or less than 5 bases requires more energy.) - The number of unpaired bases, whether interior loops or bulges.
(Unpaired bases decrease the stability of the structure.)
- The number of GC versus AU and GU base pairs.
- To compute the minimum free energy of a sequence, empirical energy parameters are used. These parameters summarize free energy change (positive or negative) associated with all possible pairing configurations, including base pair stacks and internal base pairs, internal, bulge and hairpin loops, and various motifs which are know to occur with great frequency. Zuker has online tables of free energy and enthalpy values for various motifs.
- Four major classes of RNA exist, and can be found in most organisms:
- mRNA - messenger RNA, is a sequence which codes for formation of one or more proteins.
- tRNA - transfer RNA, small (~80 bases) sequences which bring amino acids to the ribosome, where they translate mRNA into amino acid sequences.
- rRNA - ribosomal RNA sequences form ribosomes (along with ribosomal proteins).
(You can read more about the first three clases by clicking here.) - viral RNA (You see some viral RNA structures here and here.)
- It is important to note that most RNA folding algorithms predict only secondary, rather than tertiary structure. The three-dimensional shape of the molecule is important to molecular function, but is harder to predict. This is because tertiary structure is know from crystallography for only tRNA sequences (as illustrated at the top of this page). Secondary structure is usually considered a sufficient approximation, until more is know about tertiary structure of RNA.
Predicting RNA secondary structure
- Several representations of secondary structure have been utilized, each with different advantages. The planar graph representation shown above gives an intuition for the shape of an RNA sequence, but the same structure could also be represented in string notation. In string notation, balanced parenthesis are used to indicate paired bases, and periods are used to indicate unpaired bases. The secondary structure in the above figure is given as ((((((((((((((....)))))))))))))) in string notation. For a discussion of the advantages of string notation, and examples of other represenation schemes, see Hofacker et al. (1995) and Gruner et al. (1995).
- The number of possible secondary structures (S) of n bases with k base pairs is given as
- A number of strategies for predicting secondary structure have been developed. Gruner et al. provide a taxonomy of folding algorithms, and references for each algorithm. Their table is summarized here: * algorithm can predict pseudo-knots
- The Waterman algorithm
- Now that we can find the minimum free energy structure of a sequence in computationally tractable time, we should ask ``What does the optimum tell us''? That is, there may be more than one structure with the optimum free energy, or there may be many structures within 5% to 10% of the minimum free energy, and these may be topologically very different. A minimum energy folding algorithm will return only one secondary structure, though there are many candidates for the natural structure. To address this, some software packages (such as Zuker's mfold) will display a number of suboptimal folds. Inferring what structure is truly representative of the natural structure requires additional information. Phylogenetic information is often used to constrain the search by identifying highly conserved motifs. Some programs allow the user to specify constraints on the secondary structure, by specifying paired, single-stranded, or non-pairable regions, or by actively participating in the folding process.
- Of course, there are a number of limiting assumptions to existing folding algorithms. These include the kinetics of folding during transcription, the difficulty of predicting pseudo-knots, the role of chaperone proteins in folding, and the importance of modified bases (e.g. inosine or methylated bases). Some algorithms attempt to incorporate these considerations (e.g. see Abrahams et al. for predicting pseudo-knots). At best, RNA folding algorithms are first-order approximations used to infer the natural structure of a known sequence.
Related Sites
- RNA world at IMB Jena. This page contains links to databases and software, information about meetings, and a number of search utilities.
- A list of RNA related sites, compiled by Cambridge University Press.
- Image library of biological macromolecules, at IMB Jena has illustrations of molecular structure.
- Molecules R US, maintained by NIH, has a fancy interface to structural information in the Brookhaven Protein Database. You can use the interface to view molecules by a number of methods.
- Michael Zuker's rna page
- M. Zuker's interactive mfold server will fold sequences online.
- The Vienna RNA ftp site is located here.
- A list of folding software links, for a variety of platforms.
- Abstracts at the Institute for Theoretical Chemistry, in Vienna
- Abstracts at the Santa Fe Institute
References
Abrahams, J.P., M. van den Berg, E. van Batenburg, and C. Pleij. 1990. Prediction of RNA secondary structure, including pseudo-knotting by computer simulation. Nucleic Acids Research 18:3035-3044. Gesteland, R.F., and J.F. Atkins, eds. 1993. The RNA World. Cold Spring Harbor Laboratory Press. TOC can be found here.Gruner, W., R. Giegerich, D. Strothmann, C. Reidys, J. Weber, I. Hofacker, P. Stadler, and P. Schuster. Analysis of RNA sequence structure maps by exhaustive enumeration. Santa Fe Institute Preprint 95-10-099. Click here for the abstract or here for a postscript version of the paper.
Jaeger, J.A., D.H. Turner and M. Zuker. 1990. Predicting optimal and suboptimal secondary structure for RNA, in "Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences", R.F. Doolittle ed., Methods in Enzymology 183, 281-306.
McCaskill, J.S. 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105-19.
Stolorz, P., M. Huynen, I. Hofacker, and P. Stadler. RNA folding on massively parallel computers. Santa Fe Institute preprint 95-10-089. Click here for the abstract or here for a postscript version of the paper.
Turner, D.H., N. Sugimoto, and S.M. Freier. 1988. RNA structure prediction. Ann. Rev. Biophys. and Biophys. Chem. 17: 167-192.
Waterman, M.S., and T.H. Byers. 1985. A dynamic programming algorithm to find all solutions in a neighborhood of the optimum. Mathematical Biosciences, 77, 179-188.
Williams, A.L., & Tinoco, I.Jr. 1986. A dynamic programming algorithm for finding alternate RNA secondary structures. Nucleic Acids Research, 14, 299-315.
Zuker, M.. 1989. On finding all suboptimal foldings of an RNA molecule. Science 244:48-52.
Lecture notes compiled by P. Hraber, May 1996
Please send comments, additions, and corrections to him.
No comments:
Post a Comment