8.3 Nucleic Acid Chemistry in Chapter 8 Nucleotides and Nucleic Acids

Double-Helical DNA and RNA Can Be Denatured

Solutions of carefully isolated, native DNA are highly viscous at pH 7.0 and room temperature $(25 ° C)$ $left-parenthesis 25 degree upper C right-parenthesis$ . When such a solution is subjected to extremes of pH or to temperatures above $80 ° C$ $80 degree upper C$ , its viscosity decreases sharply, indicating that the DNA has undergone a physical change. Just as heat and extremes of pH denature globular proteins, they also cause denaturation, or melting, of double-helical DNA. Disruption of the hydrogen bonds between paired bases and of base-stacking interactions causes unwinding of the double helix to form two single strands, completely separate from each other along the entire length or part of the length (partial denaturation) of the molecule. No covalent bonds in the DNA are broken (Fig. 8-26).

A figure shows how double helical D N A can undergo denaturation to become partially or fully denatured and how separate strands can join and undergo annealing to form double helical D N A. — FIGURE 8-26 Reversible denaturation and annealing (renaturation) of DNA.

When the temperature or pH is returned to the range in which most organisms live, the unwound segments of the two strands spontaneously rewind, or anneal, to yield the intact duplex (Fig. 8-26). However, if the two strands are completely separated, renaturation occurs in two steps. In the first, relatively slow step, the two strands “find” each other by random collisions and form a short segment of complementary double helix. The second step is much faster: the remaining unpaired bases successively come into register as base pairs, and the two strands “zipper” themselves together to form the double helix.

The close interaction between stacked bases in a nucleic acid has the effect of decreasing its absorption of UV light relative to that of a solution with the same concentration of free nucleotides, and the absorption is decreased further when two complementary nucleic acid strands are paired. This is called the hypochromic effect. Denaturation of a double-stranded nucleic acid produces the opposite result: an increase in absorption called the hyperchromic effect. The transition from double-stranded DNA to the denatured, single-stranded form can thus be detected by monitoring UV absorption at 260 nm.

Viral or bacterial DNA molecules in solution denature when they are heated slowly (Fig. 8-27). Each species of DNA has a characteristic denaturation temperature, or melting point ( $t_{m}$ $t Subscript m$ ; formally, the temperature at which half the DNA is present as separated single strands): the higher its content of $G ≡ C$ $upper G identical-to upper C$ base pairs, the higher the melting point of the DNA. This is primarily because, as we saw earlier, $G ≡ C$ $upper G identical-to upper C$ base pairs make greater contributions to base stacking than do $A ═ T$ $upper A box drawings double horizontal upper T$ base pairs. Thus, the melting point of a DNA molecule, determined under fixed conditions of pH and ionic strength, can yield an estimate of its base composition. If denaturation conditions are carefully controlled, regions that are rich in $A ═ T$ $upper A box drawings double horizontal upper T$ base pairs will denature while most of the DNA remains double-stranded. Such denatured regions (called bubbles) can be visualized with electron microscopy (Fig. 8-28). In the strand separation of DNA that occurs in vivo during processes such as DNA replication and transcription, the site where strand separation is initiated is often rich in $A ═ T$ $upper A box drawings double horizontal upper T$ base pairs, as we shall see.

A two-part figure, a and b, shows the melting curves of two D N A specimens and the relationship between the T subscript lowercase m and the G plus C content of D N A. — FIGURE 8-27 Heat denaturation of DNA. (a) The denaturation, or melting, curves of two DNA specimens. The temperature at the midpoint of the transition ( $t_{m}$ $t Subscript m$ ) is the melting point; it depends on pH and ionic strength and on the size and base composition of the DNA. (b) Relationship between $t_{m}$ $t Subscript m$ and the G+C content of a DNA. [(b) Data from J. Marmur and P. Doty, *J. Mol. Biol.* 5:109, 1962.]

FIGURE 8-27 Heat denaturation of DNA. (a) The denaturation, or melting, curves of two DNA specimens. The temperature at the midpoint of the transition ( $t_{m}$ $t Subscript m$ ) is the melting point; it depends on pH and ionic strength and on the size and base composition of the DNA. (b) Relationship between $t_{m}$ $t Subscript m$ and the G+C content of a DNA. [(b) Data from J. Marmur and P. Doty, *J. Mol. Biol.* 5:109, 1962.]

Part a shows a graph with temperature in degrees Celsius on the horizontal axis, ranging from below 75 to above 85 and labeled in increments of 5, and with denaturation as a percentage on the vertical axis, ranging from 0 to 100 and labeled in increments of 50. Two curves are shown that both run horizontally along the horizontal axis at first. One strand begins to rise slowly above the horizontal axis at 78 degrees, then begins to increase very rapidly at (81, 5) before beginning to level off at (84, 90) and ending at (90, 99). A dotted vertical line extends up from the horizontal axis to meet the line at (83, 50), which is labeled lowercase t subscript m. The second curve begins to rise above the horizontal axis at 82.5 degrees, then begins to rise even more rapidly than the first curve starting at (84, 8). It begins to level off above the first curve at (86, 95) and ends at (90, 96). A vertical line extends up from the horizontal axis to meet the second curve at (85, 50) and is labeled lowercase t subscript m. Part b is a graph with lowercase t subscript m in degrees Celsius on the horizontal axis, ranging from 60 to 110 and labeled in increments of 10, and with G plus C (percentage of total nucleotides) on the vertical axis, ranging from 0 to 100 and labeled in increments of 20. There is a point at (65, 0). A diagonal line begins at the horizontal axis at (68, 0) and runs diagonally up to (108, 80). Many points are clustered on the line and just below the line between (80, 30) and (101, 70). All data are approximate.

A micrograph shows a long, windy strand with several places marked with arrows where the strand opens to form a circular structure before becoming a single line again. — FIGURE 8-28 Partially denatured DNA. This DNA was partially denatured, then fixed to prevent renaturation during sample preparation. Although the shadowing method used to visualize the DNA in this electron micrograph obliterates many details, single-stranded and double-stranded regions are readily distinguishable. The arrows point to some single-stranded bubbles where denaturation has occurred. The regions that denature are highly reproducible and are rich in $A ═ T$ $upper A box drawings double horizontal upper T$ base pairs.

FIGURE 8-28 Partially denatured DNA. This DNA was partially denatured, then fixed to prevent renaturation during sample preparation. Although the shadowing method used to visualize the DNA in this electron micrograph obliterates many details, single-stranded and double-stranded regions are readily distinguishable. The arrows point to some single-stranded bubbles where denaturation has occurred. The regions that denature are highly reproducible and are rich in $A ═ T$ $upper A box drawings double horizontal upper T$ base pairs.

Duplexes of two RNA strands or of one RNA strand and one DNA strand (RNA-DNA hybrids) can also be denatured. Notably, RNA duplexes are more stable to heat denaturation than DNA duplexes. At neutral pH, denaturation of a double-helical RNA often requires temperatures at least $20 ° C$ $20 degree upper C$ higher than those required for denaturation of a DNA molecule with a comparable sequence, assuming that the strands in each molecule are perfectly complementary. The stability of an RNA-DNA hybrid is generally intermediate between that of RNA and DNA duplexes. The physical basis for these differences in thermal stability is not known.

WORKED EXAMPLE 8-1 DNA Base Pairs and DNA Stability

In samples of DNA isolated from two unidentified species of bacteria, X and Y, adenine makes up 32% and 17%, respectively, of the total bases. What relative proportions of adenine, guanine, thymine, and cytosine would you expect to find in the two DNA samples? What assumptions have you made? One of these species was isolated from a hot spring $(64 ° C)$ $left-parenthesis 64 degree upper C right-parenthesis$ . Which species is most likely the thermophilic bacterium, and why?

SOLUTION:

For any double-helical DNA, A = T and G = C. The DNA from species X has 32% A and therefore must contain 32% T. This accounts for 64% of the bases and leaves 36% as $G ≡ C$ $upper G identical-to upper C$ pairs: 18% G and 18% C. The sample from species Y, with 17% A, must contain 17% T, accounting for 34% of the base pairs. The remaining 66% of the bases are thus equally distributed as 33% G and 33% C. This calculation is based on the assumption that both DNA molecules are double-stranded.

The higher the G+C content of a DNA molecule, the higher the melting temperature. Species Y, having the DNA with the higher G+C content (66%), most likely is the thermophilic bacterium; its DNA has a higher melting temperature and thus is more stable at the temperature of the hot spring.

Nucleotides and Nucleic Acids Undergo Nonenzymatic Transformations

Purines and pyrimidines, along with the nucleotides of which they are a part, undergo spontaneous alterations in their covalent structure. The rate of these reactions is generally very slow, but they are physiologically significant because of the cell’s very low tolerance for alterations in its genetic information. Alterations in DNA structure that produce permanent changes in the genetic information encoded therein are called mutations. In higher organisms, much evidence suggests an intimate link between the accumulation of mutations in an individual and the processes of aging and carcinogenesis.

Several nucleotide bases undergo spontaneous loss of their exocyclic amino groups (deamination) (Fig. 8-29a). For example, under typical cellular conditions, deamination of cytosine (in DNA) to uracil occurs in about one of every $10^{7}$ $10 Superscript 7$ cytidine residues in 24 hours. This rate of deamination corresponds to about 100 spontaneous events per day, on average, in a mammalian cell. Deamination of adenine and guanine occurs at about 1/100th this rate.

A two-part figure, a and b, shows deamination of cytosine, 5-methylcytosine, adenine, and guanine and the depurination of a guanosine residue. — FIGURE 8-29 Some well-characterized nonenzymatic reactions of nucleotides. (a) Deamination reactions. Only the base is shown. (b) Depurination, in which a purine is lost by hydrolysis of the N-β-glycosyl bond. Loss of pyrimidines through a similar reaction occurs, but much more slowly. The resulting lesion, in which the deoxyribose is present but the base is not, is called an abasic site or an AP site (apurinic site or, rarely, apyrimidinic site). The deoxyribose remaining after depurination is readily converted from the β-furanose to the aldehyde form (see Fig. 8-3), further destabilizing the DNA at this position. More nonenzymatic reactions are illustrated in Figures 8-30 and 8-31.

Part a shows deamination. Cytosine has a six-membered ring with N in position 1 at the bottom vertex and further bonded below; C 2 double bonded to O, highlighted N in position 3, highlighted C 4 bonded to highlighted N H 2 above, a highlighted double bond between position 3 and position 4, and a double bond between position 5 and position 6. An arrow pointing to the right shows uracil as a similar molecule except that the highlighted N at position 3 is now highlighted N H, highlighted C 4 is double bonded to highlighted O, and the highlighted bond between positions 3 and 4 is now a single bond. 5-methylcytosine has a similar structure to cytosine except that C 5 is bonded to C H 3. An arrow points to the right to show that it can become thymine, which has a similar structure except that the highlighted N at position 3 is now highlighted N H, highlighted C 4 is double bonded to highlighted O, and the highlighted double bond between positions 3 and 4 is now a highlighted single bond. Adenine is a six-membered ring joined with a five-membered ring by its right side. The six-membered ring has highlighted N at position 1 connected by a highlighted double bond to highlighted C 6 that is bonded to highlighted N H 2. The molecule also has N at position 3, a double bond between positions 2 and 3, and a double bond between C 4 and C 5 that is shared with the five-membered ring. The five-membered ring has N in position 7, C H in position 8, N in position 9 that is further bonded below, and a double bond between positions 7 and 8. An arrow points to the right to show that it can become hypoxanthine, which has a similar structure except that highlighted N at position 1 is bonded to highlighted H, there is a single highlighted bond between C 1 and C 6, and highlighted C 6 is connected by a highlighted double bond to highlighted O. Guanine has a similar structure to adenine except that N at position 1 is bonded to H and not highlighted, C at position 6 is double bonded to O and not highlighted, N at position 3 is highlighted, the double bond between position 2 and position 3 is highlighted, and C 2 is highlighted and bonded to highlighted N H 2. An arrow points to the right to show that it can become xanthine, which has a similar structure except that highlighted C 2 is connected by a highlighted double bond to highlighted O, the bond between positions 2 and 3 is highlighted, and highlighted N in position 3 is bonded to highlighted H. Part b shows depurination. A guanosine residue in D N A is shown. It has a five-membered sugar ring with O at the top vertex; C 1 prime bonded to N at position 9 of guanine above the ring and H below the ring; C 2 prime bonded to H above and below the ring; C 3 prime bonded to H above the ring and O H below the ring; C 4 prime bonded to C 5 prime above the ring and H below the ring, and C 5 prime bonded to 2 H and to an O outside of the shaded box that is further bonded to P that is bonded to O minus above, double bonded to O below, and bonded to O minus to the left. Guanine has a six-membered ring joined by its right side to a five-membered ring. The six-membered ring has N H at position 1, N at position 3, C at position 2 bonded to N H 2, C at position 6 double bonded to O, a double bond between positions 2 and 3, and a double bond between positions 4 and 5 that is shared with the five-membered ring. The five-membered ring has N at position 7, N at position 9 that is bonded to C 1 prime of the sugar below, and a double bond between position 1 and position 2. A reaction occurs in which H 2 O is added to produce guanine and an apurinic residue. Guanine has a similar structure as before except that N 6 is bonded to H instead of to C 1 prime of the sugar. The apurinic sugar contains the sugar and phosphate with C 1 prime now bonded to O H above instead of to N 9 of guanine.

The slow cytosine deamination reaction seems innocuous enough, but it is almost certainly the reason why DNA contains thymine rather than uracil. The product of cytosine deamination (uracil) is readily recognized as foreign in DNA and is removed by a repair system (Chapter 25). If DNA normally contained uracil, recognition of uracils resulting from cytosine deamination would be more difficult, and unrepaired uracils would lead to permanent sequence changes as they were paired with adenines during replication. Cytosine deamination would gradually lead to a decrease in $G ≡ C$ $upper G identical-to upper C$ base pairs and an increase in $A ═ U$ $upper A box drawings double horizontal upper U$ base pairs in the DNA of all cells. Over the millennia, cytosine deamination could eliminate $G ≡ C$ $upper G identical-to upper C$ base pairs and the genetic code that depends on them. Establishing thymine as one of the four bases in DNA may well have been one of the crucial turning points in evolution, making the long-term storage of genetic information possible.

Another important reaction in deoxyribonucleotides is the hydrolysis of the N-β-glycosyl bond between the base and the pentose. The base is lost, creating a DNA lesion called an AP (apurinic, apyrimidinic) site or abasic site (Fig. 8-29b). Purines are lost at a higher rate than pyrimidines. As many as one in $10^{5}$ $10 Superscript 5$ purines (10,000 per mammalian cell) are lost from DNA every 24 hours under typical cellular conditions. Depurination of ribonucleotides and RNA is much slower and less physiologically significant. In the test tube, loss of purines can be accelerated by dilute acid. Incubation of DNA at pH 3 causes selective removal of the purine bases, resulting in a derivative called apurinic acid.

Other reactions are promoted by radiation. UV light induces the condensation of two ethylene groups to form a cyclobutane ring. In the cell, the same reaction between adjacent pyrimidine bases in nucleic acids forms cyclobutane pyrimidine dimers. This happens most frequently between adjacent thymidine residues on the same DNA strand (Fig. 8-30). A second type of pyrimidine dimer, called a 6-4 photoproduct, is also formed during UV irradiation. Ionizing radiation (x-rays and gamma rays) can cause ring opening and fragmentation of bases as well as breaks in the covalent backbone of nucleic acids.

A two-part figure, a and b, shows two ways in which adjacent thymine residues can form a dimer when exposed to U V light and how the presence of thymine dimers affects the three-dimensional structure of the double helix. — FIGURE 8-30 Formation of pyrimidine dimers induced by UV light. (a) One type of reaction (on the left) results in the formation of a cyclobutyl ring involving C-5 and C-6 of adjacent pyrimidine residues. An alternative reaction (on the right) results in a 6-4 photoproduct, with a linkage between C-6 of one pyrimidine and C-4 of its neighbor. (b) Formation of a cyclobutane pyrimidine dimer introduces a bend or kink into the DNA. [(b) Data from PDB ID 1TTD, K. McAteer et al., *J. Mol. Biol.* 282:1013, 1998.]

Part a shows two adjacent thymines on a vertical D N A strand. The strand has a bond extending down to a five-carbon ring that has a bond to thymine to the right and a two-member chain down to P, from which a bond extends down to another five-carbon ring with a bond to thymine to the right and another two-member chain extending down. Each thymine is a six-membered ring with N 1 on the left bonded to the five-carbon ring of the D N A backbone. Clockwise from N 1 on the left, each ring has C 2 double bonded to O, N in positions 3 bonded to H, C 4 double bonded to O, a solid wedge bond to C 5, C 5 bonded to C H 3 and double bonded to C 6, and C 6 bonded to H and solid wedge bonded to N 1. Two arrows labeled U V light point down, one to the left and one to the right. The left-hand product is a cyclobutene thymine dimer. The vertical strand of D N A is shown with the two thymines to the right. Each thymine now has a single bond between C 5 and C 6. C 5 of the top thymine has a highlighted bond to C 5 of the bottom thymine, and C 6 of the top thymine has a highlighted bond to C 6 of the bottom thymine. The right-hand product is labeled 6-4 photoproduct. The vertical strand of D N A is shown with the two thymines to the right. The top thymine now has a single bond between C 5 and C 6. A highlighted bond extends diagonally down from C 6 of the top thymine to C 4 of the bottom thymine on the right, which is now bonded to O H instead of double bonded to O. Part b shows a double helix of D N A. Near the upper middle, two thymines are shown that are connected by two bonds and pulled away from the two complementary adenines in the opposite stand. This produces a kink in the double helix rather than a smooth, even turn. The kink is labeled.

Virtually all forms of life are exposed to energy-rich radiation capable of causing chemical changes in DNA. Near-UV radiation (with wavelengths of 200 to 400 nm), which makes up a significant portion of the solar spectrum, is known to cause pyrimidine dimer formation and other chemical changes in the DNA of bacteria and of human skin cells. We are subjected to a constant field of ionizing radiation in the form of cosmic rays, which can penetrate deep into the earth, as well as radiation emitted from radioactive elements, such as radium, plutonium, uranium, radon, $^{14} C$ $Superscript 14 Baseline upper C$ , and $^{3} H$ $cubed upper H$ . X-rays used in medical and dental examinations and in radiation therapy of cancer and other diseases are another form of ionizing radiation. It is estimated that UV and ionizing radiations are responsible for about 10% of all DNA damage caused by environmental agents.

DNA also may be damaged by reactive chemicals introduced into the environment as products of industrial activity. Such products may not be injurious per se but may be metabolized by cells into forms that are. There are two prominent classes of such agents (Fig. 8-31): (1) deaminating agents, particularly nitrous acid ( ${HNO}_{2}$ $HNO Subscript 2$ ) or compounds that can be metabolized to nitrous acid or nitrites, and (2) alkylating agents.

A two-part figure, a and b, shows three nitrous acid precursors and 4 alkylating agents that cause D N A damage. — FIGURE 8-31 Chemical agents that cause DNA damage. (a) Precursors of nitrous acid, which promotes deamination reactions. (b) Alkylating agents. Most generate modified nucleotides nonenzymatically.

Part a shows three nitrous acid precursors. Sodium nitrite is N a N O 2. Sodium nitrate is N a N O 3. Nitrosamine is N bonded to R superscript 1 on the upper left, R superscript 2 on the lower left, and to N on the right that is further double bonded to O. Part b shows four alkylating agents. S-adenosylmethionine has a five-membered sugar ring with O at the top vertex between C 4 prime and C 1 prime. C 1 prime is bonded to N 9 of adenine above and H below, C 2 and C 3 are bonded to H above and O H below, C 4 prime is bonded to C 5 above and H below, and C 5 is bonded to 2 H and to S plus of methionine. Adenine has a six-membered ring adjacent to a five-membered ring. The six-membered ring has N at position 1, N at position 3, C 6 bonded to N H 2, double bonds between positions 2 and 3 and positions 1 and 6, and a double bond between positions 4 and 5 that is shared with the five-membered ring. The five-membered ring has N at position 7, N at position 9 that is bonded to C 1 prime of the sugar below, and a double bond between positions 7 and 8. The sugar and adenine are labeled as adenosine. Methionine has vertical six-membered chain with C O O minus at the top, C 2 bonded to N H 3 plus on the left and H on the right, C 3 and C 4 bonded to 2 H, S plus substituted for C in position 5 and bonded to C 5 prime of the ring, and C 6 bonded to 3 H. Dimethylnitrosamine has N bonded to C H 3 above on the left, C H 3 below on the left, and N on the right that is further double bonded to O. Dimethylsulfate has S that is double bonded to O at the upper right, double bonded to O at the lower right, bonded to O at the lower left that is further bonded to C H 3, and bonded to O at the upper left that is further bonded to C H 3. Nitrogen mustard has N that is bonded to C H 3 on the left, bonded to C H 2 on the upper right that is further bonded to C H 2 that is bonded to C l, and bonded to C H 2 on the lower right that is further bonded to C H 2 that is bonded to C l.

Nitrous acid, formed from organic precursors such as nitrosamines and from nitrite and nitrate salts, is a potent accelerator of the deamination of bases. Bisulfite has similar effects. Both agents are used as preservatives in processed foods to prevent the growth of toxic bacteria. They do not seem to increase cancer risks significantly when used in this way, perhaps because they are used in only small amounts and make only a minor contribution to the overall levels of DNA damage. (The potential health risk from food spoilage if these preservatives were not used is much greater.)

Alkylating agents can alter certain bases of DNA. For example, the highly reactive chemical dimethylsulfate (Fig. 8-31b) can methylate a guanine to yield $O^{6}$ $upper O Superscript 6$ -methylguanine, which cannot base-pair with cytosine.

A figure shows the interconversion of two guanine tautomers and the formation of O superscript 6 end superscript methylguanine.

The top guanine tautomer has a six-membered ring on the left and a five-membered ring on the right. The six-membered ring has N H in position 1, C 2 bonded to N H 2, N in position 3, C double bonded to O in position 6, a double bond between position 2 and position 3, and a double bond between position 4 and position 5 that is shared with the five-membered ring. C 6 is labeled with the number 6. The five-membered ring has N at position 7, N H at position 9, and a double bond between position 7 and position 8. This interconverts with another molecule, but the reverse reaction is represented by a longer arrow than the forward reaction. The second guanine tautomer is similar to the first except that N at position 1 is no longer bonded to H, C 6 is bonded to O H, and there is a double bond between position 1 and position 6. C 6 is labeled with the number 6. The second tautomer undergoes a reaction involving (C H subscript 3 end subscript) subscript 2 end subscript S O subscript 4 end subscript to produce O superscript 6 end superscript methylguanine. This molecule has a similar structure to the reactant except that C 6, labeled with the number 6, is bonded to O C H 3.

Some alkylation of bases is a normal part of the regulation of gene expression. The enzymatic methylation of certain bases using S-adenosyl methionine is one example discussed below.

The most important source of mutagenic alterations in DNA is oxidative damage. Reactive oxygen species such as hydrogen peroxide, hydroxyl radicals, and superoxide radicals arise during irradiation or (more commonly) as a byproduct of aerobic metabolism. These species damage DNA through any of a large, complex group of reactions, ranging from oxidation of deoxyribose and base moieties to strand breaks. Of these species, the hydroxyl radicals are responsible for most oxidative DNA damage. Cells have an elaborate defense system to destroy reactive oxygen species, including enzymes such as catalase and superoxide dismutase that convert reactive oxygen species to harmless products. A fraction of these oxidants inevitably escape cellular defenses, however, and are able to damage DNA. Accurate estimates for the extent of this damage are not yet available, but every day the DNA of each human cell is subjected to thousands of damaging oxidative reactions.

This is merely a sampling of the best-understood reactions that damage DNA. Many carcinogenic compounds in food, water, and air exert their cancer-causing effects by modifying bases in DNA. Nevertheless, the integrity of DNA as a polymer is better maintained than that of either RNA or protein, because DNA is the only macromolecule that has the benefit of extensive biochemical repair systems. These repair processes (described in Chapter 25) greatly lessen the impact of damage to DNA.

Some Bases of DNA Are Methylated

Certain nucleotide bases in DNA molecules are enzymatically methylated. Adenine and cytosine are methylated more often than guanine and thymine. Methylation is generally confined to certain sequences or regions of a DNA molecule. In some cases, the function of methylation is well understood; in others, the function remains unclear. All known DNA methylases use S-adenosylmethionine as a methyl group donor (Fig. 8-31b). E. coli has two prominent DNA methylation systems. One serves in a defense role, allowing the cell to distinguish its DNA from foreign DNA by marking its own DNA with methyl groups. The cell can then identify as foreign and destroy DNA without the methyl groups (this is known as a restriction-modification system; see p. 303). The other enzyme system methylates adenosine residues within the sequence ( $5^{'}$ $5 prime$ )GATC( $3^{'}$ $3 prime$ ) to $N^{6}$ $upper N Superscript 6$ -methyladenosine (Fig. 8-5a). Methyl groups are added by the Dam (DNA adenine methylation) methylase shortly after DNA replication, allowing the cell to distinguish newly replicated DNA from older cellular DNA (see Fig. 25-20).

In eukaryotic cells, about 5% of cytidine residues in DNA are methylated to 5-methylcytidine (Fig. 8-5a). Methylation is most common at CpG sequences, producing methyl-CpG symmetrically on both strands of the DNA. The extent of methylation of CpG sequences varies by region in large eukaryotic DNA molecules, affecting DNA metabolism and gene expression.

The Chemical Synthesis of DNA Has Been Automated

An important practical advance in nucleic acid chemistry was the rapid and accurate synthesis of short oligonucleotides of known sequence. The methods were pioneered by H. Gobind Khorana and his colleagues in the 1970s. Refinements by Robert Letsinger and Marvin Caruthers led to the chemistry now in widest use, called the phosphoramidite method (Fig. 8-32). The synthesis is carried out with the growing strand attached to a solid support, using principles similar to those used by Merrifield for peptide synthesis (see Fig. 3-30), and is readily automated. The efficiency of each addition step is very high, allowing the routine synthesis of polymers containing 70 or 80 nucleotides and, in some laboratories, much longer strands. The availability of relatively inexpensive DNA polymers with predesigned sequences revolutionized all areas of biochemistry.

A figure shows the chemical synthesis of D N A by the phosphoramidite method. — FIGURE 8-32 Chemical synthesis of DNA by the phosphoramidite method. Automated DNA synthesis is conceptually similar to the synthesis of polypeptides on a solid support. The oligonucleotide is built up on the solid support (silica), one nucleotide at a time, in a repeated series of chemical reactions with suitably protected nucleotide precursors. The first nucleoside (which will be the $3^{'}$ $3 prime$ end) is attached to the silica support at the $3^{'}$ $3 prime$ hydroxyl (through a linking group, R) and is protected at the $5^{'}$ $5 prime$ hydroxyl with an acid-labile dimethoxytrityl group (DMT). The reactive groups on all bases are also chemically protected. The protecting DMT group is removed by washing the column with acid (the DMT group is colored, so this reaction can be followed spectrophotometrically). The next nucleotide has a reactive phosphoramidite at its $3^{'}$ $3 prime$ position: a trivalent phosphite (as opposed to the more oxidized pentavalent phosphate normally present in nucleic acids) with one linked oxygen replaced by an amino group or a substituted amine. In the common variant shown, one of the phosphoramidite oxygens is bonded to the deoxyribose, the other is protected by a cyanoethyl group, and the third position is occupied by a readily displaced diisopropylamino group. Reaction with the immobilized nucleotide forms a $5^{'}$ $5 prime$ , $3^{'}$ $3 prime$ linkage, and the diisopropylamino group is eliminated. In step , the phosphite linkage is oxidized with iodine to produce a phosphotriester linkage. Reactions 2 through 4 are repeated until all nucleotides are added. At each step, excess nucleotide is removed before addition of the next nucleotide. In steps and the remaining protecting groups on the bases and the phosphates are removed, and in the oligonucleotide is separated from the solid support and purified. The chemical synthesis of RNA is somewhat more complicated because of the need to protect the $2^{'}$ $2 prime$ hydroxyl of ribose without adversely affecting the reactivity of the $3^{'}$ $3 prime$ hydroxyl.

FIGURE 8-32 Chemical synthesis of DNA by the phosphoramidite method. Automated DNA synthesis is conceptually similar to the synthesis of polypeptides on a solid support. The oligonucleotide is built up on the solid support (silica), one nucleotide at a time, in a repeated series of chemical reactions with suitably protected nucleotide precursors. The first nucleoside (which will be the $3^{'}$ $3 prime$ end) is attached to the silica support at the $3^{'}$ $3 prime$ hydroxyl (through a linking group, R) and is protected at the $5^{'}$ $5 prime$ hydroxyl with an acid-labile dimethoxytrityl group (DMT). The reactive groups on all bases are also chemically protected. The protecting DMT group is removed by washing the column with acid (the DMT group is colored, so this reaction can be followed spectrophotometrically). The next nucleotide has a reactive phosphoramidite at its $3^{'}$ $3 prime$ position: a trivalent phosphite (as opposed to the more oxidized pentavalent phosphate normally present in nucleic acids) with one linked oxygen replaced by an amino group or a substituted amine. In the common variant shown, one of the phosphoramidite oxygens is bonded to the deoxyribose, the other is protected by a cyanoethyl group, and the third position is occupied by a readily displaced diisopropylamino group. Reaction with the immobilized nucleotide forms a $5^{'}$ $5 prime$ , $3^{'}$ $3 prime$ linkage, and the diisopropylamino group is eliminated. In step , the phosphite linkage is oxidized with iodine to produce a phosphotriester linkage. Reactions 2 through 4 are repeated until all nucleotides are added. At each step, excess nucleotide is removed before addition of the next nucleotide. In steps and the remaining protecting groups on the bases and the phosphates are removed, and in the oligonucleotide is separated from the solid support and purified. The chemical synthesis of RNA is somewhat more complicated because of the need to protect the $2^{'}$ $2 prime$ hydroxyl of ribose without adversely affecting the reactivity of the $3^{'}$ $3 prime$ hydroxyl.

The starting molecule has a five-membered sugar ring with O at the top vertex between C 4 prime on the left and C 1 prime on the right. C 1 prime is bonded to a highlighted box labeled base subscript 1 end subscript above the ring and to H below the ring, C 2 prime is bonded to H above and below the ring, C 3 prime is bonded to H above and O H below the ring, C 4 prime is bonded to C 5 prime above the ring and H below the ring, and C 5 prime is bonded to 2 H and further bonded to O that is further bonded to a blue highlighted box labeled D M T. Text next to the boxed labeled base subscript 1 end subscript reads, nucleoside protected at 5 prime hydroxyl. Step 1: Attach nucleoside to silica support. The product is a similar molecule except that C 3 prime is bonded to O below that is further bonded to R that is further bonded to a sphere labeled S i. Step 2: Remove protecting group. D M T leaves, producing a similar molecule in which C prime 6 is bonded to O H. Step 3: Add next nucleotide. During this step, a molecule is added that has a five-membered sugar ring with O at the top vertex between C 4 prime on the left and C 1 prime on the right. C 1 prime is bonded to a highlighted box labeled base subscript 2 end subscript above the ring and to H below the ring, C 2 prime is bonded to H above and below the ring, C 3 prime is bonded to H above and to O below the ring that is further bonded, C 4 prime is bonded to C 5 prime above the ring and H below the ring, and C 5 prime is bonded to 2 H and further bonded to O that is further bonded to a blue highlighted box labeled D M T. The O beneath C 3 prime is enclosed in a purple box and is bonded to P below that is bonded to O on the left, all in the same box, and to N plus in another box. A line points from the word phosphoramidite to the purple box. The O at the left of the purple box is further bonded to a chain in a blue box labeled cyanoethyl protecting group. The chain has 2 C H 2 further bonded to C N. The P in the purple box is further bonded to N plus in another blue box. N plus is bonded to H below and to C H bonded to 2 C H 3 on each side. This blue box is labeled diisopropylamino activating group. During step 3, a diisopropylamine byproduct leaves. This has N bonded to N below and on each side to C H further bonded to 2 C H 3. The product has the second nucleoside on top, with base subscript 2, with a bond from C 3 prime to O that is further bonded to P that is further bonded to O on the left that is bonded to 2 C H 2 and further bonded to C N and that is further bonded to O below that is bonded to C 5 prime of the first nucleoside that has base subscript 1 bonded to C 1 prime. Step 4: Oxidize to form triester. This produces a similar molecule in which C 3 prime of the top sugar is bonded to O below that is further bonded to P that is double bonded to O on the right, bonded to O on the left that is further bonded to 2 C H 2 that is further bonded to C N, and that is bonded to O below that is further bonded to C 5 prime of the sugar below. A dotted arrow points back to the original product of step 1, a nucleoside with D M T bonded to O bonded to C 5 prime, C 1 bonded to base subscript 1, and C 3 prime bonded to O that is further bonded to R that is bonded to S i. Text accompanying the dotted arrow reads, repeat steps 2 to 4 until all residues are added. An arrow points down accompanied by three steps. Step 5: Remove protecting groups from bases. Step 6: Remove cyanoethyl groups from phosphates. Step 7: Cleave chain from silica support. This produces an oligonucleotide chain shown as a 5 prime to 3 prime strand with bases extending upward.

Gene Sequences Can Be Amplified with the Polymerase Chain Reaction

Genome projects, as described in Chapter 9, have given rise to online databases containing the complete genome sequences of thousands of organisms. This archive of sequence information allows researchers to greatly amplify any DNA segment they might be interested in with the polymerase chain reaction (PCR), a process conceived by Kary Mullis in 1983. Even DNA segments with unknown sequences can be amplified if the sequences flanking them are known. The amplified DNA can then be used for a multitude of purposes, as we shall see.

The PCR procedure, shown in Figure 8-33, relies on DNA polymerases, enzymes that synthesize DNA strands from deoxyribonucleotides (dNTPs), using a DNA template. DNA polymerases do not synthesize DNA de novo, but instead must add nucleotides to the $3^{'}$ $3 prime$ ends of preexisting strands, referred to as primers (see Chapter 25). In PCR, two synthetic oligonucleotides are prepared for use as replication primers that can be extended by a DNA polymerase. These oligonucleotide primers are complementary to sequences on opposite strands of the target DNA, positioned so that their $5^{'}$ $5 prime$ ends define the ends of the segment to be amplified, and they become part of the amplified sequence. The $3^{'}$ $3 prime$ ends of the annealed primers are oriented toward each other and positioned to prime DNA synthesis across the targeted DNA segment.

A figure shows the steps of the polymerase chain reaction. — FIGURE 8-33 Amplification of a DNA segment by the polymerase chain reaction (PCR). The PCR procedure has three steps: DNA strands are separated by heating, then annealed to an excess of short synthetic DNA primers (orange) that flank the region to be amplified (dark blue); new DNA is synthesized by polymerization catalyzed by DNA polymerase. The thermostable *Taq* DNA polymerase is not denatured by the heating steps. The three steps are repeated for 25 or 30 cycles in an automated process carried out in a small benchtop instrument called a thermocycler.

FIGURE 8-33 Amplification of a DNA segment by the polymerase chain reaction (PCR). The PCR procedure has three steps: DNA strands are separated by heating, then annealed to an excess of short synthetic DNA primers (orange) that flank the region to be amplified (dark blue); new DNA is synthesized by polymerization catalyzed by DNA polymerase. The thermostable *Taq* DNA polymerase is not denatured by the heating steps. The three steps are repeated for 25 or 30 cycles in an automated process carried out in a small benchtop instrument called a thermocycler.

A double stranded piece of D N A is shown as two horizontal bars. The top strand runs 3 prime to 5 prime and the bottom strand runs 5 prime to 3 prime. The middle third of each bar is highlighted, and a bracket indicates that this is the region of target D N A to be amplified. An arrow points downward accompanied by two steps. Step 1: Heat to separate strands. Step 2: Add synthetic D N A oligonucleotide primers; cool. The product has the same two horizontal strands of D N A, but they are farther apart. Beneath the left side of the highlighted central region of the top strand, a small yellow box with its 5 prime end on the left has an arrow pointing to the right. Above the right side of the highlighted central region of the bottom strand, a similar yellow box has its 5 prime end on the right and an arrow pointing to the left. Step 3: Add thermostable italicized uppercase T lowercase a lowercase q end italics D N A polymerase to catalyze 5 prime to 3 prime D N A synthesis. An arrow points down to show the products. The top strand still has the small box beneath the left side of its highlighted region but now has a complementary strand that runs all the way to the right end of the top strand. This new strand has a highlighted region directly beneath the highlighted region in the parent strand. The original bottom strand still has a small box above the right side of its highlighted region. A new complementary strand runs all the way to the left of the original bottom strand and has a highlighted region directly above the highlighted region of the original bottom strand. An arrow points down accompanied by text that reads, Repeat steps 1 and 2. In the product, the original and new strands have moved farther apart as the original strand did after steps 1 and 2. The original top strand again has a small box beneath the left side of its highlighted region with a small arrow pointing to the right. The bottom strand has a small box above the right side of its highlighted region with a small arrow pointed toward the left, but does not have a far left side beyond the small box from which it originally started. The original bottom strand and its new complementary strand look the same as the original top strand and its new complementary strand except that they are inverted. An arrow points down accompanied by text that reads, D N A synthesis (step 3) is catalyzed by the thermostable D N A polymerase (still present). The product has four sets of complementary strands. The original top strand has a new strand beneath exactly like it did after step 3. Beneath, its complementary daughter strand now has a complementary piece exactly above its highlighted region only with a small box on each side like the initial box with an arrow that started each strand. The original bottom strand has a new strand above exactly like it did after step 3. Above, its complementary daughter strand resembles the one for the top strand but inverted. An arrow points down accompanied by text reading, Repeat steps 1 through 3. The original top strand is shown with a new complementary strand beneath exactly like it had after steps 1 and 2. The first complementary strand that is produced has a strand above it just like it did in the previous step, with a complementary piece above its highlighted region but nothing above its right side. Beneath this, there are two highlighted regions that are paired with a small yellow box on the right side of the top strand and on the left side of the bottom strand. Beneath this, there is another piece with paired highlighted regions and then a bottom strand that extends farther to the right with no complementary piece. There is a small yellow box at the right of the top highlighted region and at the left of the bottom highlighted region for each of this top set of four pairs of strands. The bottom four sets of strands look the same but inverted. An arrow points down to text that reads, After 20 cycles, the target sequence has been amplified about 10 superscript 6 end superscript fold.

The PCR procedure has an elegant simplicity. Basic PCR requires four components: a DNA sample containing the segment to be amplified, the pair of synthetic oligonucleotide primers, a pool of deoxynucleoside triphosphates, and a DNA polymerase. There are three steps (Fig. 8-33). In step , the reaction mixture is heated briefly to denature the DNA, separating the two strands. In step , the mixture is cooled so that the primers can anneal to the DNA. The high concentration of primers increases the likelihood that they will anneal to each strand of the denatured DNA before the two DNA strands (present at a much lower concentration) can reanneal to each other. Then, in step , the primed segment is replicated selectively by the DNA polymerase, using the pool of dNTPs. The cycle of heating, cooling, and replication is repeated 25 to 30 times over a few hours in an automated process, amplifying the DNA segment between the primers until the sample is large enough to be readily analyzed or cloned (described in Chapter 9).

Each replication cycle doubles the number of target DNA segment copies, so the concentration grows exponentially. The flanking DNA sequences increase in number linearly, but this effect is quickly rendered insignificant. After 20 cycles, the targeted DNA segment has been amplified more than a millionfold ( $2^{20}$ $2 Superscript 20$ ); after 30 cycles, more than a billionfold. Step of PCR uses a heat-stable DNA polymerase such as the Taq polymerase, isolated from a thermophilic bacterium (Thermus aquaticus) that thrives in hot springs where temperatures approach the boiling point of water. The Taq polymerase remains active after every heating step (step ) and does not have to be replenished.

This technology is highly sensitive: PCR can detect and amplify just one DNA molecule in almost any type of sample — including some ancient ones. The double-helical structure of DNA is highly stable, but as we have seen, DNA does degrade slowly over time through various nonenzymatic reactions. PCR has allowed the successful cloning of rare, undegraded DNA segments isolated from samples more than 40,000 years old. Investigators have used the technique to clone DNA fragments from the mummified remains of humans and extinct animals, such as the woolly mammoth, creating the research fields of molecular archaeology and molecular paleontology. DNA from burial sites has been amplified by PCR and used to trace ancient human migrations (see Fig. 9-31). Epidemiologists use PCR-enhanced DNA samples from human remains to trace the evolution of human pathogenic viruses. Due to its capacity to amplify just a few strands of DNA that might be present in a sample, PCR is a potent tool in forensic medicine (Box 8-1). It is also being used to detect viral infections and certain types of cancers before they cause symptoms, as well as in the prenatal diagnosis of genetic diseases.

BOX 8-1

A Potent Weapon in Forensic Medicine

One of the most accurate methods for placing an individual at the scene of a crime is a fingerprint. But with the advent of recombinant DNA technology (see Chapter 9), a much more powerful tool became available: DNA genotyping (also called DNA fingerprinting or DNA profiling). As first described by English geneticist Alec Jeffreys in 1985, the method is based on sequence polymorphisms, slight sequence differences among individuals — 1 in every 1,000 bp, on average. Each difference from the prototype human genome sequence (the first human genome that was sequenced) occurs in some fraction of the human population; every person has some differences from this prototype.

Forensic work focuses on differences in the lengths of short tandem repeat (STR) sequences. An STR locus is a specific location on a chromosome where a short DNA sequence (usually 4 bp long) is repeated many times in tandem. The loci most often used in STR genotyping are short — 4 to 50 repeats (16 to 200 bp for tetranucleotide repeats) — and have multiple length variants in the human population. More than 20,000 tetranucleotide STR loci have been characterized in the human genome. And more than a million STRs of all types may be present in the human genome, accounting for about 3% of all human DNA.

The length of a particular STR in a given individual can be determined with the aid of the polymerase chain reaction (see Fig. 8-33). The use of PCR also makes the procedure sensitive enough to be applied to the very small samples often collected at crime scenes. The DNA sequences flanking STRs are unique to each STR locus and are identical (except for very rare mutations) in all humans. PCR primers are targeted to this flanking DNA and are designed to amplify the DNA across the STR (Fig. 1a). The length of the PCR product then reflects the length of the STR in that sample. Because each human inherits one chromosome of each chromosome pair from each parent, the STR lengths on the two chromosomes are often different, generating two different STR lengths from one individual.

FIGURE 1 (a) STR loci can be analyzed by PCR. Suitable PCR primers (with an attached dye to aid in subsequent detection) are targeted to sequences on each side of the STR, and the region between them is amplified. If the STR sequences have different lengths on the two chromosomes of an individual’s chromosome pair, two PCR products of different lengths result. (b) The PCR products from amplification of up to 16 STR loci can be run on a single capillary acrylamide gel (a “16-plex” analysis). Determination of which locus corresponds to which signal depends on the color of the fluorescent dye attached to the primers used in the process and on the size range in which the signal appears (the size range can be controlled by which sequences — those closer to or more distant from the STR — are targeted by the designed PCR primers). Fluorescence is given in relative fluorescence units (RFU), as measured against a standard supplied with the kit. [(b) Information from Carol Bingham, Promega Corporation.]

Part a shows two horizontal pieces of double stranded D N A, each with the 3 prime end on the left of the top strand and the 5 prime end on the left of the bottom strand. Both pieces have a highlighted region near the center. The highlighted region is labeled Locus D 7 S 820. The highlighted region in the top piece has eight small segments inside and is labeled allele 1. The bottom piece has a shorter highlighted region with six small segments inside labeled allele 2. The segments are labeled S T R sequences. Small yellow boxes are shown above each piece of D N A just to the left of the highlighted region. Each has a small green sphere on the left side and an arrow pointing to the right. These are labeled dye-labeled P C R primers. Small yellow boxes with arrows pointing toward the left are shown beneath each piece of D N A and to the right of the highlighted regions. An arrow pointing downward is labeled P C R amplification. Four double-stranded pieces of D N A are shown. Each has a green circle at the far right, then a yellow box, then the start of the top main strand. Each has a yellow box at the left side of the bottom strand. Two of the molecules have eight segments in the highlighted region, and two have six segments in the highlighted region. An arrow points down to a small horizontal bar and is labeled run P C R fragments on a capillary gel. The small cylinder has two thin green lines encircling it. Beneath these lines, there is a small horizontal line with two sharp green peaks above it. The left-hand peak is slightly taller than the right-hand peak. Text reads, scan of P C R product bands derived from locus D 7 S 820. Part b shows a much longer cylinder with many bands on it in three different colors. This is labeled P C R fragments run on a capillary gel. It is positioned above a graph that has peaks that match the bands encircling the cylinder, with each peak beneath a particular band and in the same color. The graph has a bottom horizontal axis labeled 16-plex. The top horizontal axis runs from below 105 b p on the left to above 385 b p on the right, labeled in increments of 70. The vertical axis is labeled R F U and ranges from 0 to over 1800, labeled in increments of 600. The peaks are as follows with color and coordinate: black (104, 1250), black (110, 1100), blue (120, 1000), green (125, 900), blue (130, 1250), green (135, 1000), black (140, 1300), black (145, 1400), blue (15, 1300), blue (175, 1300), green (177, 1250), green (180, 1150), blue (115, 1300), blue (120, 1250), black (124, 1150), black (127, 1150), green (129, 1270), green (235, 1300), black (260, 1700), black (285, 1800), green (290, 1300), blue (305, 1190), blue (320, 1200), black (330, 1150) green (332, 1270), black (351, 1700), blue (385, 1700), blue (295, 1600), green (400, 1650), green (420, 1700). All data are approximate.

The PCR products are subjected to electrophoresis on a very thin polyacrylamide gel in a capillary tube. The resulting bands are converted into a set of peaks that accurately reveal the size of each PCR fragment and thus the length of the STR in the corresponding allele. Analysis of multiple STR loci can yield a profile that is unique to an individual (Fig. 1b). This is typically done with a commercially available kit that includes PCR primers unique to each locus, linked to colored dyes to help distinguish the different PCR products. PCR amplification enables investigators to obtain STR genotypes from less than 1 ng of partially degraded DNA, an amount that can be obtained from a single hair follicle, a small fraction of a drop of blood, a small semen sample, or samples that might be months or even many years old. When good STR genotypes are obtained, the chance of misidentification is less than 1 in $10^{18}$ $10 Superscript 18$ (a quintillion).

The successful forensic use of STR analysis required standardization, first attempted in the United Kingdom in 1995. The U.S. standard, called the Combined DNA Index System (CODIS), established in 1998, was originally based on 13 well-studied STR loci. These continue to be required in any DNA-typing experiment carried out in the United States (Table 1) and are also used internationally. The amelogenin gene is also used as a marker in the analyses. Present on the human sex chromosomes, this gene has a slightly different length on the X and Y chromosomes. PCR amplification across this gene thus generates different-sized products that can reveal the sex of the DNA donor. By mid-2019, the CODIS database contained more than 18 million STR genotypes and had assisted in nearly 500,000 forensic investigations. As the CODIS database has expanded, the chance for adventitious matches has increased. The CODIS standard was expanded in 2017 to include 20 core loci. The new loci were incorporated with international agreement to ensure compatibility. Loci utilized in commercial kits have since expanded from 16 to 24.

TABLE 1 Properties of the Loci Used for the CODIS Database
Locus	Chromosome	Repeat motif	Repeat length (range)^a	Number of alleles seen^b
CSF1PO	5	TAGA	5–16	20
FGA	4	CTTT	12.2–51.2	80
TH01	11	TCAT	3–14	20
TPOX	2	GAAT	4–16	15
VWA	12	[TCTG][TCTA]	10–25	28
D3S1358	3	[TCTG][TCTA]	8–21	24
D5S818	5	AGAT	7–18	15
D7S820	7	GATA	5–16	30
D8S1179	8	[TCTA][TCTG]	7–20	17
D13S317	13	TATC	5–16	17
D16S539	16	GATA	5–16	19
D18S51	18	AGAA	7–39.2	51
D21S11	21	[TCTA][TCTG]	12–41.2	82
Amelogenin^c	X, Y	Not applicable
Data from J. M. Butler, Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers, 2nd edn, Elsevier, 2005, p. 96. ^aRepeat lengths observed in the human population. Partial or imperfect repeats can be included in some alleles. ^bNumber of different alleles observed as of 2005 in the human population. Careful analysis of a locus in many individuals is a prerequisite to its use in forensic DNA typing. ^cAmelogenin is a gene, of slightly different size on the X and Y chromosomes, that is used to establish gender.

DNA genotyping has been used to both convict and acquit suspects, and to establish paternity with an extraordinary degree of certainty. In the United States, there have been many hundreds of postconviction exonerations based on DNA evidence. The impact of these procedures on court cases will continue to grow as standards are refined and as international STR genotyping databases grow. Even very old mysteries can be solved. In 1996, STR genotyping helped confirm identification of the bones of the last Russian czar and his family, who were assassinated in 1918.

WORKED EXAMPLE 8-2 Designing Primers for the Polymerase Chain Reaction

You set out to amplify the chromosomal sequence between the bases underlined below, using PCR. Only one strand is shown, but keep in mind that it is paired to a complementary strand.

TGGTAGGCCGAT – – – [1,000 bp] – – – TAGCTAAGAATCTTTCTCAGAA

Design single-stranded oligonucleotide primers to amplify only those sequences between (and including) the underlined bases. The optimal length for PCR primers is usually 18 to 22 nucleotides. For this example, simply write the first 6 nucleotides of each primer.

SOLUTION:

Left primer GTAGGC

Right primer AAGATT

Remember, (1) the two strands of DNA are antiparallel, (2) DNA synthesis proceeds uniquely in the $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction, (3) DNA synthesis must be directed across the region to be amplified, and (4) DNA sequences are always written in the $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction. The sequence given is thus oriented $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ , left to right, even though no orientation guides are provided. The left primer must be complementary to the strand not shown, which is in the opposite orientation. Thus, the left primer begins at the G and is identical to the sequence shown. The right primer must direct DNA synthesis right to left, synthesizing a strand complementary to the strand provided. It will begin with an A complementary to the T, and then continue with additional nucleotides complementary to the strand shown. Although that sequence is written right to left, it is in the $3^{'} \to 5^{'}$ $3 prime right-arrow 5 prime$ direction and must be flipped to be in the conventional orientation, written $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ , left to right. Companies that provide PCR primers expect orders to be written in the conventional $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction. Doing otherwise is a common (and expensive) mistake.

The Sequences of Long DNA Strands Can Be Determined

In its capacity as a repository of information, a DNA molecule’s most important property is its nucleotide sequence. Until the late 1970s, determining the sequence of a nucleic acid containing as few as 5 or 10 nucleotides was very laborious. The development of two techniques in 1977 (one by Allan Maxam and Walter Gilbert, the other by Frederick Sanger) made possible the sequencing of larger DNA molecules. Although the two methods are similar in strategy, Sanger sequencing, also known as dideoxy chain-termination sequencing, is both technically easier and more accurate (Fig. 8-34).

A three-part figure, a, b, and c, shows how adding d d A T P stops synthesis during Sanger sequencing in part a, the structure of a dideoxynucleotide in part b, and how the fragments produced during Sanger sequencing migrate on an electrophoresis gel in part c. — FIGURE 8-34 DNA sequencing by the Sanger method. This method makes use of the mechanism of DNA synthesis by DNA polymerases (Chapter 25). (a) DNA polymerases require both a primer (a short oligonucleotide strand), to which nucleotides are added, and a template strand to guide the selection of each new nucleotide. In cells, the $3^{'}$ $3 prime$ -hydroxyl group of the primer reacts with an incoming deoxynucleoside triphosphate — dGTP in this example — to form a new phosphodiester bond. The Sanger sequencing procedure uses dideoxynucleoside triphosphate (ddNTP) analogs to interrupt DNA synthesis. When a ddNTP — ddATP in this example — is inserted in place of a dNTP, strand elongation is halted after the analog is added, because the analog lacks the $3^{'}$ $3 prime$ -hydroxyl group needed for the next step. (b) Dideoxynucleoside triphosphate analogs have —H (red) rather than —OH at the $3^{'}$ $3 prime$ position of the ribose ring. (c) The DNA to be sequenced is used as the template strand, and a short primer, radioactively (in the example here) or fluorescently labeled, is annealed to it. The result is a solution containing a mixture of labeled fragments of particular length, each ending with a C residue. The different-sized fragments, separated by electrophoresis, reveal the location of C residues. This procedure is repeated separately for each of the four ddNTPs, and the sequence can be read directly from an autoradiogram of the gel. Because shorter DNA fragments migrate faster, the fragments near the bottom of the gel represent the nucleotide positions closest to the primer (the $5^{'}$ $5 prime$ end), and the sequence is read (in the $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction) from bottom to top. Note that the sequence obtained is that of the strand *complementary* to the strand being analyzed.

FIGURE 8-34 DNA sequencing by the Sanger method. This method makes use of the mechanism of DNA synthesis by DNA polymerases (Chapter 25). (a) DNA polymerases require both a primer (a short oligonucleotide strand), to which nucleotides are added, and a template strand to guide the selection of each new nucleotide. In cells, the $3^{'}$ $3 prime$ -hydroxyl group of the primer reacts with an incoming deoxynucleoside triphosphate — dGTP in this example — to form a new phosphodiester bond. The Sanger sequencing procedure uses dideoxynucleoside triphosphate (ddNTP) analogs to interrupt DNA synthesis. When a ddNTP — ddATP in this example — is inserted in place of a dNTP, strand elongation is halted after the analog is added, because the analog lacks the $3^{'}$ $3 prime$ -hydroxyl group needed for the next step. (b) Dideoxynucleoside triphosphate analogs have —H (red) rather than —OH at the $3^{'}$ $3 prime$ position of the ribose ring. (c) The DNA to be sequenced is used as the template strand, and a short primer, radioactively (in the example here) or fluorescently labeled, is annealed to it. The result is a solution containing a mixture of labeled fragments of particular length, each ending with a C residue. The different-sized fragments, separated by electrophoresis, reveal the location of C residues. This procedure is repeated separately for each of the four ddNTPs, and the sequence can be read directly from an autoradiogram of the gel. Because shorter DNA fragments migrate faster, the fragments near the bottom of the gel represent the nucleotide positions closest to the primer (the $5^{'}$ $5 prime$ end), and the sequence is read (in the $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction) from bottom to top. Note that the sequence obtained is that of the strand *complementary* to the strand being analyzed.

Part a shows a horizontal piece of double-stranded D N A. The top strand is labeled primer strand and begins with its 5 prime end on the left. The bottom strand is labeled template strand and begins with its 3 prime end on the left. Sets of three horizontal lines represent hydrogen bonds between bases, and there are three such bonds between each G C pair and two between each A T pair. The top strand has alternating circles representing phosphates and five-membered rings with one- or two-ringed bases extending downward. The bottom strand is similar except that the bases extend up from the backbone. From left to right, the bases on top are G, A, T, T, and C with G in the process of being added. From left to right, the bottom strand has C, T, A, A, G, C, T, C, G, A, C, and T. The G in the process of being added has formed three hydrogen bonds with C on the bottom strand and has a five-membered ring above with O H bonded to its upper right vertex and C H 2 bonded to its upper right vertex and further bonded to a chain of three P. The sugar to the left of this phosphate chain has O H bonded to its upper right vertex, and there is an arrow from a pair of highlighted electrons on the O to the first P of the three-phosphate chain. A second arrow points from the bond between this P and the second P in the chain to the second P. The structure with G, a sugar, and three P is labeled lowercase d uppercase G uppercase T uppercase P. To the right, a similar structure is shown that has A instead of G and a red highlighted H instead of O H boded to the upper right vertex of the sugar ring. This structure is labeled lowercase d lowercase d uppercase A uppercase T uppercase P. Text reads, lowercase d lowercase d uppercase A uppercase T uppercase P terminates synthesis because 3 prime H cannot act as nucleophile in phosphodiester bond formation. Part b shows a molecule with a base, a five-membered sugar ring, and a chain of three phosphates. The five-membered ring has O at the top vertex between C 4 prime and C 1 prime, C 1 prime bonded to a base above the ring, C 2 prime bonded to H below the ring, C 3 prime bonded to red highlighted H below the ring, C 4 prime bonded to C 5 prime above the ring that is further bonded to 2 H and a chain of three phosphates. Part c shows a rectangle labeled autoradiograph of electrophoresis gel. On the left, it is numbered from 12 to 1 from top to bottom. On the right, it is labeled from top to bottom as 3 prime A G T C G A G C T T A G t prime. Text below the right side reads, sequence of complementary strand. Across the top of the gel from left to right, the lanes are labeled A, C, G, T. There are bands in the lanes as follows: A 12, 7, 2; C 9, 5; G 11, 8, 6, 1; T 10, 4, 3. A box above the gel shows two bars. The top bar is labeled on the left as primer 5 prime star and has O H extending from the right side. The bottom bar is labeled on the left as template 3 prime and has a strand extending from the right side that reads, C T A A G C T C G A C T. This is added to lowercase d uppercase A uppercase T uppercase P, lowercase d uppercase C uppercase T uppercase P, lowercase d uppercase G uppercase T uppercase P, and lowercase d uppercase T uppercase T uppercase P. Four arrows point down from this box. The first arrow points down to plus lowercase d lowercase d uppercase A uppercase T uppercase P above three bars with a star on the left and a chain on the right. From top to bottom, the chains are G A T T C G A G C T G lowercase d lowercase d uppercase A; G A T T C G lowercase d lowercase d uppercase A; and G lowercase d lowercase d uppercase A. An arrow points down to the A lane of the gel. The second arrow points down to plus lowercase d lowercase d uppercase C uppercase T uppercase P above two bars with a star on the left and a chain on the right. From top to bottom, the chains are G A T T C G A G lowercase d lowercase d uppercase C and G A T T lowercase d lowercase d uppercase C. An arrow points down to the C lane of the gel. The third arrow points down to plus lowercase d lowercase d uppercase G uppercase T uppercase P above four bars with a star on the left and a chain on the right. From top to bottom, the chains are G A T T C G A G C T lowercase d lowercase d uppercase G; G A T T C G A lowercase d lowercase d uppercase G; G A T T C lowercase d lowercase d uppercase G; and lowercase d lowercase d uppercase G. An arrow points down to the G lane of the gel. The fourth arrow points down to plus lowercase d lowercase d uppercase T uppercase T uppercase P above three bars with a star on the left and a chain on the right. From top to bottom, the chains are G A T T C G A G C lowercase d lowercase d T; G A T lowercase d lowercase d uppercase T; G A lowercase d lowercase d uppercase T. An arrow points down to the T lane of the gel.

Any protocol for DNA sequencing has two parts. One must first chemically distinguish between G, C, T, and A residues. A second strategy is then needed to determine where the four residues appear in the overall sequence.

Sanger sequencing exploited new (at the time) information about the mechanism of DNA synthesis by DNA polymerase to distinguish between the four nucleotides, making use of nucleotide analogs called dideoxynucleoside triphosphates (ddNTPs) to interrupt synthesis specifically at one or another type of nucleotide. Like PCR, Sanger’s method makes use of DNA polymerases and a primer to synthesize a DNA strand complementary to the strand under analysis. Each added deoxynucleotide is complementary, through base pairing, to a base in the template strand. In the reaction catalyzed by DNA polymerase, the $3^{'}$ $3 prime$ -hydroxyl group of the primer reacts with an incoming dNTP to form a new phosphodiester bond (Fig. 8-34a). The ddNTPs interrupt DNA synthesis because they bind to the template strand but lack the $3^{'}$ $3 prime$ -hydroxyl group needed to add the next nucleotide (Fig. 8-34b). Once a ddNTP is added to a growing strand, that strand cannot be extended further.

For instance, to identify C residues, a small amount of ddCTP is added to a reaction system containing a much larger amount of dCTP (along with the other three dNTPs). A competition then occurs every time the DNA polymerase encounters a G in the template strand. Usually, dC is added, and synthesis of the strand continues. Sometimes, ddC will be added instead, and the strand will be terminated at that position. Thus, a small fraction of the synthesized strands are prematurely terminated at every position where dC would normally be added, opposite each template dG. Given the excess of dCTP over ddCTP, the chance that the analog will be incorporated instead of dC is small. But enough ddCTP is present to ensure that some of the strands will be terminated at each G residue in the template.

The result is a solution containing a mixture of fragments, each ending with a ddC residue. The fragments differ in length, and locating the C residues then relies on precise electrophoretic methods that allow separation of DNA strands differing in size by only one nucleotide residue. (See Fig. 3-18 for a description of gel electrophoresis.) Note that in most sequencing protocols, the sequence obtained is that of the newly synthesized strand complementary to the template strand being analyzed.

When this procedure was first developed, the process was repeated separately for each of the four ddNTPs. Radioactively labeled primers allowed researchers to detect the DNA fragments generated during the DNA synthesis reactions. The sequence of the synthesized DNA strand was read directly from an autoradiogram of the resulting gel (Fig. 8-34c). Because shorter DNA fragments migrate faster, the fragments near the bottom of the gel represented the nucleotide positions closest to the primer (the $5^{'}$ $5 prime$ end), and the sequence was read (in the $5^{'} \to 3^{'}$ $5 prime right-arrow 3 prime$ direction) from bottom to top.

DNA sequencing was first automated by a variation of the Sanger method, in which each of the four ddNTPs used for a reaction was labeled with a different-colored fluorescent tag (Fig. 8-35). With this technology, all four fluorescent ddNTPs could be introduced into a single reaction. The terminated fragments, each of a different size, could be separated by electrophoresis in a single gel lane. The identity of the residue that terminated each fragment was made evident by its fluorescent color. Researchers could sequence DNA molecules containing thousands of nucleotides in a few hours, and the entire genomes of hundreds of organisms were sequenced in this way. For example, in the Human Genome Project, researchers sequenced all $3.2 \times 10^{9}$ $3.2 times 10 Superscript 9$ bp of the DNA in a human cell (see Chapter 9) in an effort that spanned nearly a decade and included contributions from dozens of laboratories worldwide. This form of Sanger sequencing is still used for routine analysis of short segments of DNA.

A figure shows the automation of D N A sequencing reactions. — FIGURE 8-35 Automation of DNA sequencing reactions. In the Sanger method, each ddNTP can be linked to a fluorescent (dye) molecule that gives the same color to all the fragments terminating in that nucleotide, with a different color for each nucleotide. All four labeled ddNTPs are added to the reaction mix together. The resulting colored DNA fragments are separated by size in an electrophoretic gel in a capillary tube (a refinement of gel electrophoresis that allows faster separations). All fragments of a given length migrate through the capillary gel together in a single band, and the color associated with each band is detected with a laser beam. The DNA sequence is read by identifying the color sequences in the bands as they pass the detector and feeding this information directly to a computer. The amount of fluorescence in each band is represented as a peak in the computer output. [Data from Dr. Lloyd Smith, University of Wisconsin–Madison, Department of Chemistry.]

At the top of the figure, six irregularly arranged blue lines are shown labeled template of unknown sequence. Above the left side of each blue line is a small yellow bar labeled primer with its 3 prime end on the right. An arrow points down from these bars, showing the addition of highlighted D N A polymerase, four lowercase d uppercase N uppercase T uppercase Ps, and four fluorescently labeled lowercase d lowercase d uppercase N uppercase T uppercase Ps. This produces six strands that each have one of the original blue lines beneath a new strand that is dark blue with a yellow piece from the primer on the left and a colored circle on the right containing a letter. The circles are labeled green A, yellow G, orange T, blue C, orange T, yellow G. An arrow labeled denature points downward. This produces six new strands, now stacked in vertically, all varying in length with the yellow primer on the left, a blue segment, and a circle on the right labeled with a letter. Text reads, dye-labeled segments of D N A, copied from template with unknown sequence. An arrow points down to a long vertical cylinder that contains horizontal bands of different colors. From top to bottom, the bands are yellow, yellow, blue, red, green, green, green, yellow, blue, blue, orange, orange, yellow, yellow, orange, yellow, yellow, orange, green, yellow, orange, orange, orange, yellow with a laser beam running through it, orange, blue, blue. An arrow pointing from the top of the cylinder toward the bottom is labeled D N A migration. Near the bottom of the column on the left is a rectangular box above a smaller rectangular box, labeled detector. To the right of the cylinder, another rectangular box is labeled laser. A laser beam travels horizontally from an opening in the laser, across a yellow band in the cylinder, and into a circular opening in the detector. An arrow points down to a rectangular band labeled computer-generated result after bands migrate past detector. The bottom horizontal axis of the band is labeled with the letters C, C, T, G, T, T, T, G, A, T, G, G, T, G, G, T, T, C, C, G, A, A, A, T, C, G, G. There is a colored peak above each band matching the color of the circles used previously. For example, the letter C is beneath a blue peak. All of the peaks are similar, but not identical, in height and extend to near the top of the horizontal axis. The top horizontal axis is labeled at 20 near the left, 30 near the middle, and 40 toward the right side.

DNA Sequencing Technologies Are Advancing Rapidly

The billions of base pairs in a complete human genome can now be sequenced in a day or two, the millions in a bacterial genome in a few hours. With modest expense, a personal genomic sequence can be routinely included in each individual’s medical record. These advances have been made possible by methods sometimes referred to as next-generation, or “next-gen,” sequencing. The sequencing strategies have some similarities to the Sanger method. Innovations have allowed a miniaturization of the procedure, a massive increase in scale, and a corresponding decrease in cost. Two widely used approaches, reversible terminator sequencing and single molecule real time (SMRT) sequencing, both developed commercially, are described.

In both approaches, large genomes are sequenced by first collecting the DNA from many cells of the organism or individual. The DNA is sheared at random locations to generate fragments of a particular average size. The individual fragments — many with overlapping sequences — are immobilized on a solid support, and each is sequenced in place. Fluorescent dyes linked to the nucleotides and powerful optical systems that can detect incorporation of each new nucleotide by DNA polymerase allow the sequencing process to be monitored directly. The fluorescent dyes and the design of the methods solve both problems in DNA sequencing at once, identifying each nucleotide and fixing its location in the large sequence. Individual regions of a genome may be sequenced hundreds, even thousands, of times. The entire genomic sequence is reconstructed by computer programs that align the sequences of overlapping fragments.

In the reversible terminator sequencing method developed by Illumina, the genomic DNA to be sequenced is sheared so as to generate fragments a few hundred base pairs long. Synthetic oligonucleotides of known sequence are ligated to each end of each fragment, providing a point of reference on every DNA molecule. The individual fragments are then immobilized on a solid surface, and each is amplified in place by PCR to form a tight cluster of identical fragments. The solid surface is part of a channel on a flow cell that allows liquid solutions to stream over the samples. The result is a solid surface just a few centimeters wide with millions of attached DNA clusters, each cluster containing multiple copies of a single DNA sequence derived from a random genomic DNA fragment. To provide a starting point for DNA polymerase, an oligonucleotide primer is then added that is complementary to the oligonucleotides of known sequence ligated to the various fragment ends. All of these millions of clusters are sequenced at the same time, with the data from each cluster captured and stored by a computer.

The actual sequencing of each cluster employs reversible terminator sequencing (Fig. 8-36). Four different modified deoxynucleotides (A, T, G, and C), each with a particular fluorescent label that identifies the nucleotide by color, are added to the sequencing reaction, along with the DNA polymerase. The labeled nucleotides also incorporate blocking groups attached to their $3^{'}$ $3 prime$ ends that permit only one nucleotide to be added to each strand. The polymerase adds the appropriate nucleotide to the strands in each cluster, giving each cluster a color that corresponds to the added nucleotide. Next, lasers excite all the fluorescent labels, and an image of the entire surface reveals the color (and thus the identity) of the base added to each cluster. The fluorescent label and the blocking groups are then chemically or photolytically removed. The surface goes dark until the solution with labeled nucleotides and DNA polymerase is again introduced to the surface, allowing the next nucleotide to be added to each cluster. The sequencing proceeds stepwise. Read lengths obtained with this technology (that is, the length of individual DNA sequences that can be accurately determined) are typically 100 to 300 nucleotides. Read length is limited by constraints on cluster density on the flow cell surface and by small inefficiencies in the PCR reactions needed to amplify each cluster accurately. Accuracy is high, with error rates as low as 0.1%.

A five-part figure, a, b, c, d, and e, shows next-generation reversible terminator sequencing. — FIGURE 8-36 Next-generation reversible terminator sequencing. (a) Blocking groups on each fluorescently labeled nucleotide prevent multiple nucleotides from being added in a single cycle. (b) Artist’s rendition of nine successive cycles from one very small part of an Illumina sequencing run. Each colored spot represents the location of a cluster of immobilized identical oligonucleotides affixed to the surface of the flow cell. The white-circled spots represent the same two clusters on the surface over successive cycles, with the sequences indicated. Data are recorded and analyzed digitally. (c) Typical flow cell used for a next-generation sequencer. Millions of DNA fragments can be sequenced simultaneously in each of the four channels. (d) A dCTP molecule modified with a fluorescent dye and a $3^{'}$ $3 prime$ blocking group for use in reversible terminator sequencing. Both the dye and the $3^{'}$ $3 prime$ -end-blocking group can be removed, either chemically or photolytically, leaving a free $3^{'}$ $3 prime$ -OH group for addition of the next nucleotide. The modified nucleotides currently used in reversible terminator sequencing are proprietary. (e) Part of the surface of one channel during a sequencing reaction.

Part a shows a series of steps. Text at the top of the first step reads, add blocked, fluorescently labeled nucleotides. Brightly colored C, A, T, and G bases are shown, with an arrow pointing to a vertical D N A strand with a longer right strand and the bases shown in faded color. The left strand has its 5 prime end at the top and its 3 prime end at the bottom. The bases are A, C, G, and highlighted T. Text below reads, thymine nucleotide added; fluorescent color observed and recorded. From top to bottom, the right strand has T, G, C, A, T, G, C, A, T, G, C, A and ends at the 5 prime end. Step 1: Remove labels and blocking groups; wash; add blocked, labeled nucleotides. Brightly colored C, A, T, and G are added. The vertical piece of D N A looks the same except that the brightly colored T has been replaced by a lighter T, like the other bases in the molecule, and a brightly colored A has been added beneath it. Text reads, adenine nucleotide added; fluorescent color observed and recorded. In the next step, four new brightly colored nucleotides are added. Text reads, remove labels and blocking groups; wash; add blocked, labeled nucleotides. The A now looks like the other nucleotides and a brightly colored C has been added below it. Text reads, cytosine nucleotide added; fluorescent color observed and recorded. In the next step, four new brightly colored nucleotides are added. Text reads, remove labels and blocking groups; wash; add blocked, labeled nucleotides. The C now looks like the other nucleotides and a brightly colored G has been added below it. Text reads, guanine nucleotide added; fluorescent color observed and recorded. Part b shows a series of 9 squares. Each square contains many bright dots arranged randomly, of which one is circled at the upper left and another is circled at the lower right. On the left, text reads, lowercase d uppercase N uppercase T uppercase P incorporated. Beneath this, text reds T A C G G T C T C: These letters match the top colored dots in the squares, so the left-hand square has a yellow dot matching T, the second square has a blue dot matching A, and so on. Beneath this list of bases, a second list of bases reads, C C C C C C A G T: The colors of these bases match the colors of the bottom circled dots in each square. Part c shows a rectangular device labeled illumina registered trademark symbol l on the top left and a box on the top right labeled S 4. The device contains an open region in the center with four vertical rods lined up in parallel, each pointed at the top and bottom. The open region has two vertical rectangular openings at the top and two at the bottom. Part d shows a molecule labeled 3 prime O allyl d C T P allyl Bodipyl F L 510. The molecule has a five-membered sugar ring with O at the top vertex between C 4 prime and C 1 prime. C 1 prime is bonded to N above at position 1 of cytosine. C 3 prime is bonded to O below the ring that is further bonded to C H 2 that is bonded to C H that is double bonded to another C. C 4 prime is bonded to O that begins a chain of three phosphates. Each phosphate consists of P bonded to O minus above, double bonded to O below, and bonded to the left to O that is bonded to P of the next phosphate. The final P is bonded to the left to O minus. Cytosine has N at position 1, C double bonded to O at position 2, N at position 3, C at position 4 bonded to N H 2, double bonds at positions 3 to 4 and 5 to 6, and C 5 bonded to C that is triple bonded to C that is bonded to C H 2 that is further bonded to N H that is further bonded to C that is double bonded to O above and further bonded to O. The O is bonded to C H 2 that is bonded to C that is double bonded to C below and further bonded to C H 2 on the right that is bonded to C H 2 that is bonded to N H that is further bonded to C that is double bonded to O and further bonded to C H 2 that is bonded to C H 2 that is further bonded to a blue highlighted three-ring structure. The blue-highlighted structure has a five-membered ring joined to a six-membered ring that is joined to a five-membered ring. The left-hand five-membered ring is bonded to the nonhighlighted chain from its bottom left vertex, has double bonds at its top left and bottom sides, shares its right side with the six-membered ring, and has N substituted for C at the bottom right vertex shared with the six-membered ring. The six-membered ring has a double bond at its top left side, N substituted for C at the bottom left and bottom right vertices, a right side shared with the right-hand five-membered ring, and B at the bottom vertex that is further bonded to 2 F. The right-hand five-membered ring has double bonds at its top side and lower right side, C H 3 bonded to its top right vertex, and C H 3 bonded to its bottom vertex. Part e shows many densely packed bright dots against a black background.

The relatively short read lengths produced by the Illumina technology are problematic in some situations, such as the sequencing of long stretches of DNA where short sequences are repeated over and over. Pacific Biosciences has pioneered the single-molecule real time (SMRT) sequencing method that allows read lengths averaging up to 30,000 to 40,000 bp (Fig. 8-37). The SMRT technology has a lower throughput, a higher cost, and a higher error rate than the Illumina approach. However, the very long read lengths are essential in some applications, particularly to reconstruct the complete genomes of higher organisms that may contain extensive regions of repeated DNA sequences. They also facilitate the detection of genomic alterations — deletions, duplications, or rearrangements of genomic segments — that arise in some cells, such as those in cancerous tumors (see Box 24-2).

A three-part figure, a, b, and c, shows S M R T sequencing. — FIGURE 8-37 SMRT sequencing. (a) One pore in an SMRT cell. The pore is smaller in diameter than the wavelength of visible light, so that light projected at the bottom penetrates only a short distance into the pore. (b) DNA fragments sequenced by SMRT technology. Hairpin oligonucleotides are ligated to both ends. A primer for DNA synthesis is annealed to one and sometimes both single-stranded regions in the hairpin ends. (c) DNA synthesis by a DNA polymerase immobilized within a SMRT pore. A fluorescent dye is attached to the triphosphate of the nucleoside triphosphates used. The dye is released from each nucleotide incorporated into the growing DNA strand and its wavelength recorded.

Part a shows a cylinder in cutaway beneath four circles representing the tops of similar cylinders. The cylinder contains a half oval cone that is rounded at the top and wide at the bottom. The interior of the oval is blue. There is a dotted line around the circumference of the bottom of the cylinder with a small layer of blue beyond that has a layer of yellow beneath and a red surface at the bottom. To the right, a scale is shown that shows colors ranging from red at the top to dark blue at the bottom. Text above the scale reads, 1/100 n m. The scale is labeled from 0.01 near the bottom to 0.06 near the top at increments of 0.01. Part b shows a vertical double helix with an uncoiled loop at the top and a loop at the bottom. A molecule labeled Greek letter phi 29 D N A polymerase is shown on the right of the loop with two oval halves and a narrower center through which the strand pass. The bottom half of the loop, extending from the D N A polymerase, has a new strand coiled with it to form a double helix. At the left side of the loop, the new strand separates and extends up alongside the double stranded molecule. Part c shows a cylinder similar to that in part a with four circles above it. Greek letter phi 29 D N A polymerase is visible at the bottom with a long vertical double helix extending vertically upward from it on the right and a single new strand labeled product extending up vertically from it on the left. The blue cone is smaller, and the bottom of the cylinder is entirely white.

SMRT sequencing utilizes SMRT cells with 150,000 pores, each pore smaller in diameter (~70 nm) than the wavelength of visible light. Attenuated light from an excitation beam penetrates only the lower 20 to 30 nm in each pore, producing a light volume small enough to accommodate just one DNA polymerase molecule (Fig. 8-37a). A single DNA polymerase is immobilized at the bottom of each pore. Genomic DNA is sheared to generate random fragments that are tens of thousands of base pairs in length. Hairpin oligonucleotides are ligated to both ends of each fragment, so that the two DNA strands are joined in one continuous circle (Fig. 8-37b). A primer is annealed to the open single-stranded DNA at one end of the fragment, and this is captured by a DNA polymerase in one of the pores to initiate DNA synthesis.

Fluorescent nucleotides are introduced, with A, T, G, and C each having a specific colored dye attached to the triphosphate. When a nucleotide binds to the DNA polymerase in the pore, it is immobilized long enough to produce a brief fluorescent light pulse that can be read by a detector (Fig. 8-37c). The fluorescent dye is released with the pyrophosphate as each new phosphodiester bond is formed. The error rate is high, about 10% to 15%. However, as the DNA template is a continuous circle, one DNA polymerase can replicate the same fragment over and over. The light pulses continue without interruption as new nucleotides are added in real time, and each pore thus generates a movie of light pulses (sometimes several hours long) that corresponds to the repeated sequencing of the DNA fragment bound in that pore. A computer program deletes the known sequences of the hairpin ends. Error is reduced by compiling a consensus sequence of the fragment by automated alignment of the many repeated sequencing passes and acceptance of the most common nucleotide signal detected at each position.

Translating the sequences of millions of short DNA fragments into a complex and contiguous genomic sequence requires the computerized alignment of overlapping fragments (Fig. 8-38). The number of times that a particular nucleotide in a genome is sequenced, on average, is referred to as the sequencing depth or sequencing coverage. In many cases, a sufficiently large number of random fragments are sequenced so that each nucleotide in the genome is sequenced an average of hundreds to thousands of times (100× to 1,000× coverage). Although the coverage of particular nucleotides may vary, a high level of coverage ensures that most sequencing errors will be detected and eliminated. The overlaps allow the computer to trace the sequence through a chromosome, from one overlapping fragment to another, permitting the assembly of long, contiguous sequences called contigs. In a successful genomic sequencing exercise, many contigs can extend over millions of base pairs.

A figure shows sequence assembly. — FIGURE 8-38 Sequence assembly. In a genomic sequence, each base pair of the genome is usually represented in multiple sequenced fragments, referred to as reads. This schematic shows how the overlaps between reads are used to assemble a contiguous segment of the genomic sequence, or contig. The numbers at the top represent base-pair positions in the genome, relative to an arbitrarily defined reference point. All sequence fragments come from a particular long contig. The reads are represented by horizontal colored bars. DNA strand segments are sequenced at random, with sequences obtained from one strand ( $5^{'}$ $5 prime$ to $3^{'}$ $3 prime$ , left to right) or the other strand ( $5^{'}$ $5 prime$ to $3^{'}$ $3 prime$ , right to left) represented by blue lines or red lines, respectively. The coverage bar at the top indicates how many times the sequence at a particular position has appeared in a sequenced read, with higher numbers corresponding to increased quality of the output sequence data.

FIGURE 8-38 Sequence assembly. In a genomic sequence, each base pair of the genome is usually represented in multiple sequenced fragments, referred to as reads. This schematic shows how the overlaps between reads are used to assemble a contiguous segment of the genomic sequence, or contig. The numbers at the top represent base-pair positions in the genome, relative to an arbitrarily defined reference point. All sequence fragments come from a particular long contig. The reads are represented by horizontal colored bars. DNA strand segments are sequenced at random, with sequences obtained from one strand ( $5^{'}$ $5 prime$ to $3^{'}$ $3 prime$ , left to right) or the other strand ( $5^{'}$ $5 prime$ to $3^{'}$ $3 prime$ , right to left) represented by blue lines or red lines, respectively. The coverage bar at the top indicates how many times the sequence at a particular position has appeared in a sequenced read, with higher numbers corresponding to increased quality of the output sequence data.

A strip across the top of the figure has a line labeled at 6000, 6400, 6800, and 7200. At the top left, text reads, selection: 7947 arrow pointing right to 8059 equals 112. Beneath the axis with the numerical labels, text reads coverage threshold above an open box on the left with the word conflicts to its right. To the right, there is a green line that is thin, then thicker from about 4900 to 6050, then thinner, then thicker from 6300 to the right end of the line. At the bottom of this strip, text reads depth of coverage and a box to the right has 10 marked halfway up its vertical axis. The strip has green on its bottom half that varies in height from the left to the right. Beneath this strip, there are many horizontal red and blue boxes of varying lengths. They begin at the upper left with one red box above a shorter red box above a very short blue box, above a longer blue box, above a slightly shorter red box, above a longer red box, above a short red box. Next, a blue box begins under the right end of the bottom red box. A new blue box begins under its right side, and then an orange box begins slightly to the right of the beginning of the blue box. This pattern continues with each box beginning to the right of the beginning point of the previous box until the boxes end at the bottom right of the graph, forming a diagonal pattern from the upper left to the lower right.

The rapid evolution of DNA sequencing technologies shows no signs of slowing. As costs plummet and sensitivity increases, new applications come online to enhance medicine, forensics, archaeology, and many other fields. We present some of those applications in Chapter 9.

SUMMARY 8.3 Nucleic Acid Chemistry

Native DNA undergoes reversible unwinding and separation of strands (melting) upon heating or at extremes of pH. DNAs rich in $G ≡ C$ $upper G identical-to upper C$ pairs have higher melting points than DNAs rich in $A ═ T$ $upper A box drawings double horizontal upper T$ pairs.
DNA is a relatively stable polymer. Spontaneous reactions such as deamination of certain bases, hydrolysis of base-sugar N-glycosyl bonds, radiation-induced formation of pyrimidine dimers, and oxidative damage occur at very low rates, yet are important because of a cell’s very low tolerance for changes in its genetic material.
DNA is subject to enzymatic modification of nucleotide bases at particular locations. Methylated bases are common.
Oligonucleotides of known sequence can be synthesized rapidly and accurately.
The polymerase chain reaction (PCR) provides a convenient and rapid method for amplifying segments of DNA if the sequences of the ends of the targeted DNA segment are known.
Routine DNA sequencing of genes or short DNA segments is carried out using an automated variation of Sanger dideoxy sequencing.
DNA sequences, including entire genomes, can be efficiently determined in hours or days using commercial next-gen sequencing technologies.