4.5 Determination of Protein and Biomolecular Structures in Chapter 4 The Three-Dimensional Structure of Proteins

4.5 Determination of Protein and Biomolecular Structures

In this chapter we have presented many types of protein structures. How were these structures determined? Structural biology is the study of the three-dimensional structures of biomolecules, including proteins, nucleic acids, lipid membranes, and oligosaccharides. Structural biologists combine biochemical approaches with physical tools and computational methods to obtain these structures. Structural biology is extraordinarily powerful for elucidating the relationships between the structure and function of proteins, the molecular basis for enzymatic catalysis and ligand binding, and evolutionary relationships between proteins. Here we focus primarily on three commonly used methods in structural biology: x-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). Which method a structural biologist uses depends on the system being studied and what information is to be learned. Often structural biologists combine multiple methods to provide a more complete view of function.

Increasingly, computational methods such as molecular dynamics simulations and in silico protein folding are proving to be essential for structural biologists, and these are discussed in Box 4-5.

Box 4-5

Video Games and Designer Proteins

Computational tools are now indispensable to biochemistry. While advances in computers and software have revolutionized how protein structures can be solved by x-ray crystallography, NMR, and cryo-EM, advances in computational chemistry have now allowed a number of studies of protein structure to be carried out entirely in silico using powerful molecular modeling and dynamics software. Key developments in this field were made by Martin Karplus, Michael Levitt, and Arieh Warshel, who received the Nobel Prize in Chemistry in 2013 for their work on the theoretical and computational tools needed to carry out computer simulations of molecules as large as proteins.

Protein folding is a problem particularly well-suited for computational biochemists. Protein-folding landscapes can be complex, and nearly an infinite number of possible peptide conformations are theoretically available for the protein to sample as it folds. However, to understand how folding happens efficiently and by only sampling a small portion of all possible conformations requires computers that can quickly test tens of thousands of possible folding trajectories in a brief time. Sometimes these calculations are carried out on powerful supercomputers, but in other instances research groups have taken to crowdsourcing these problems to thousands of citizen scientists.

David Baker

David Baker has pioneered crowdsourcing protein-folding problems to engage millions of computer users directly in biochemistry. The Rosetta@home project distributes complex protein-folding problems to citizen scientists who run the software in the background on their home computers. Each home computer is able to complete a portion of a protein-folding experiment, and these results can be combined with those from other users to predict protein structures.

Foldit is a video game in which users compete to solve protein-folding puzzles (Fig. 1). Users are rewarded for making stabilizing contacts in the structure, like hydrogen bonds or favorable van der Waals interactions. Virtual protein designs (sometimes called theozymes) from the video game can then be tested in the real world. DNA genes coding for the virtual proteins can be created in the laboratory and be used to recombinantly produce the proteins in bacteria, using techniques described in Chapter 9. The purified proteins can then be structurally characterized by x-ray crystallography or NMR to see how closely the real-world structures and computationally designed structures match.

FIGURE 1 Foldit uses a video game interface to crowdsource protein-folding problems. Proteins designed in the video game can be recreated in the laboratory and studied using biochemical and structural methods.

One exciting application of computational biochemistry is the creation of “designer proteins.” These are proteins with completely novel folds or functions not yet identified in nature. Designer proteins have potential applications in bioengineering, medicine, materials science, and chemistry. In one recent example, the Baker Lab designed a series of Foldit puzzles (such as “Cover the Ligand,” in which users had to redesign a ligand-binding site by rebuilding the protein backbone and changing amino acid side chains) to improve the catalytic efficiency of a previously engineered Diels-Alderase (Fig. 2). The Diels-Alder reaction is a common method for organic chemists to create carbon–carbon bonds; however, it is almost never used by naturally occurring enzymes. Diels-Alderase enzymes would potentially be useful for carrying out environmentally friendly carbon bond forming reactions in water instead of toxic organic solvents. Foldit competitors used the video game to design thousands of virtual enzymes. The players were successful, and once their top virtual enzyme was produced and tested in the real world it was found to be more than 18-fold more active than the original Diels-Alderase. Incredibly, the protein structure created by the Foldit game players was also confirmed by x-ray crystallography. This is definitely a case where we can encourage biochemistry students to spend more time playing video games!

FIGURE 2 The Diels-Alder reaction catalyzed by an enzyme designed by Foldit players. Diels-Alder reactions are commonly used by organic chemists to form $mathml alt text404$ $upper C minus upper C$ bonds; however, there are very few examples of enzymes catalyzing this type of chemistry in nature. [Information from C. B. Eiben et al., Nature Biotechnol. 30:190, 2109.]

The bond-line structure of a molecule has a benzene ring at the bottom with C O O minus bonded to its bottom vertex and a chain extending up from its top vertex. The chain is a skeletal formula with C H 2 bonded to O on the right that is further bonded to C that is double bonded to O on the right and further bonded to N bonded to H on the left. N is further bonded to C that is double bonded to C on the left that is bonded to C above that is double bonded to C on the right. This is added to a molecule with N bonded to 2 C H 2 below and bonded to C above that is double bonded to C on the right and bonded to C on the left that is further double bonded to C above. In a reversible reaction, this produces the following product. A benzene ring has C O O minus bonded to its bottom vertex and a chain extending up from its top vertex. The chain is a skeletal formula with C H 2 that is further bonded to O on the left that is further bonded to C that is double bonded to O on the left and further bonded to N H on the right that is further bonded to the bottom vertex of a six-membered ring that has a double bond on the left side and a chain bonded to its lower right vertex with C double bonded to O on the right and further bonded to N below that is bonded to 2 C H 3 below.

X-ray Diffraction Produces Electron Density Maps from Protein Crystals

The spacing of atoms in a crystal lattice can be determined by measuring the locations and intensities of spots produced on a detector by a beam of x-rays of given wavelength, after the beam has been diffracted by the electrons of the atoms. For example, x-ray analysis of sodium chloride crystals shows that $mathml alt text405$ $Na Superscript plus$ and $mathml alt text406$ $Cl Superscript minus$ ions are arranged in a simple cubic lattice. The spacing of the different kinds of atoms in complex organic molecules, even very large ones such as proteins, can also be analyzed by x-ray diffraction methods. However, the technique for analyzing crystals of complex molecules is far more laborious than the technique for analyzing simple salt crystals. When the repeating pattern of the crystal is a molecule as large as, say, a protein, the numerous atoms in the molecule yield thousands of diffraction spots that must be analyzed by computer.

Consider how images are generated in a light microscope. Light from a point source is focused on an object. The object scatters the light waves, and these scattered waves are recombined by a series of lenses to generate an enlarged image of the object. The smallest object whose structure can be determined by such a system — that is, the resolving power of the microscope — is determined by the wavelength of the light, in this case visible light, with wavelengths in the range of 400 to 700 nm. Objects smaller than half the wavelength of the incident light cannot be resolved. To resolve objects as small as proteins we must use x-rays, with wavelengths in the range of 0.7 to 1.5 Å (0.07 to 0.15 nm). However, there are no lenses that can recombine x-rays to form an image; instead, the pattern of diffracted x-rays is collected directly and an image is reconstructed by mathematical techniques.

The amount of information obtained from x-ray crystallography depends on the degree of structural order in the sample. Some important structural parameters were obtained from early studies of the diffraction patterns of the fibrous proteins arranged in regular arrays in hair and wool. However, the orderly bundles formed by fibrous proteins are not crystals — the molecules are aligned side by side, but not all are oriented in the same direction. More-detailed three-dimensional structural information about proteins requires a highly ordered protein crystal. The structures of many proteins are not yet known, simply because they have proved difficult to crystallize. Practitioners have compared making protein crystals to holding together a stack of bowling balls with cellophane tape.

Operationally, there are several steps in x-ray structural analysis (Fig. 4-30). A crystal is placed in an x-ray beam between the x-ray source and a detector, and a regular array of spots, called reflections, is generated. The spots are created by the diffracted x-ray beam, and each atom in a molecule makes a contribution to each spot. An electron-density map of the protein is reconstructed from the overall diffraction pattern of spots by a mathematical technique called a Fourier transform. In effect, the computer acts as a “computational lens.” A model for the structure is then built that is consistent with the electron-density map.

A four-part figure, a, b, c, and d, shows the steps in determining the structure of sperm whale myoglobin by x-ray crystallography. Part a shows an x-ray diffraction patter, part b shows a three-dimensional electron-density map of heme created using data from the diffraction pattern, part c shows how regions of greatest electron density can be used to work out the structure, and part d shows the completed structure. — FIGURE 4-30 Steps in determining the structure of sperm whale myoglobin by x-ray crystallography. (a) X-ray diffraction patterns are generated from a crystal of the protein. (b) Data extracted from the diffraction patterns are used to calculate a three-dimensional electron-density map. The electron density of only part of the structure, the heme, is shown here. (c) Regions of greatest electron density reveal the location of atomic nuclei, and this information is used to piece together the final structure. Here, the heme structure is modeled into its electron-density map. (d) The completed structure of sperm whale myoglobin, including the heme. [(a, b, c) Photo and data from George N. Phillips, Jr., University of Wisconsin–Madison, Department of Biochemistry. (d) Data from PDB ID 2MBW, E. A. Brucker et al., *J. Biol. Chem.* 271:25,419, 1996.]

Part a shows a bright circle against a dark background. There is a dark rod extending up from its center out of the frame. There are many small bright dots. A lighter circular ring is visible within the overall bright circle. An arrow beneath a picture of a computer screen points to part b, which shows an irregular shape made of many lines. There is a circle in the center within an X-shaped opening in a larger roughly circular structure with protrusions on all sides and openings in the center of the top, left, and right portions. An arrow beneath a picture of a computer screen points to part c, which shows the same structure with a chemical structure overlain on it. At the top, left, right, and bottom, there are hexagonal rings. These surround the openings on the top, left, and right portions. Each ring has its vertex pointing into the center and is connected to the next ring by two segments that forms a V with its apex outward. Each ring has two vertices joined with the rings on its sides and a short segment extending from one vertex. The left-hand ring has a V-shaped segment extending from the bottom left vertex, the top ring has a similar structure extending from the top left vertex, the right-hand ring has a longer segment that extends right and then down extending from its lower right vertex, and the bottom ring has a longer segment similar to that of the right ring extending from its right bottom vertex. An arrow beneath a picture of a computer screen points to part d, where helices are shown with a ball-and-stick model in the upper center right that contains four large rings in the center and four smaller rings toward the outside. The ball-and-stick model resembles the structure from part c.

John Kendrew found that the x-ray diffraction pattern of crystalline myoglobin (isolated from muscles of the sperm whale) is highly complex, with nearly 25,000 reflections. Computer analysis of these reflections took place in stages. The resolution improved at each stage until, in 1959, the positions of virtually all the nonhydrogen atoms in the protein had been determined. The amino acid sequence of the protein, obtained by chemical analysis, was consistent with the molecular model. Over 100,000 protein structures, many of them much more complex than myoglobin, have since been determined to a similar level of resolution by x-ray crystallography.

The physical environment in a crystal, of course, is not identical to that in solution or in a living cell. A crystal imposes a space and time average on the structure deduced from its analysis, and x-ray diffraction studies provide little information about molecular motion within the protein. The conformation of proteins in a crystal can also be affected by nonphysiological factors such as incidental protein-protein contacts within the crystal. However, when structures derived from the analysis of crystals are compared with structural information obtained by other means (such as NMR, as described below), the crystal-derived structure almost always represents a functional conformation of the protein.

Distances between Protein Atoms Can Be Measured by Nuclear Magnetic Resonance

An advantage of nuclear magnetic resonance (NMR) studies is that they are carried out on macromolecules in solution, whereas x-ray crystallography is limited to molecules that can be crystallized. NMR can also illuminate the dynamic side of protein structure, including conformational changes, protein folding, and interactions with other molecules.

NMR is a manifestation of nuclear spin angular momentum, a quantum mechanical property of atomic nuclei. Only certain atoms, including $mathml alt text408$ $Superscript 1 Baseline upper H$ , $mathml alt text409$ $Superscript 13 Baseline upper C$ , $mathml alt text410$ $Superscript 15 Baseline upper N$ , $mathml alt text411$ $Superscript 19 Baseline upper F$ , and $mathml alt text412$ $Superscript 31 Baseline upper P$ , have the kind of nuclear spin that gives rise to an NMR signal. Nuclear spin generates a magnetic dipole. When a strong, static magnetic field is applied to a solution containing a single type of macromolecule, the magnetic dipoles are aligned in the field in one of two orientations: parallel (low energy) or antiparallel (high energy). A short (∼10 μs) pulse of electromagnetic energy of suitable frequency (the resonant frequency, which is in the radio frequency range) is applied at right angles to the nuclei aligned in the magnetic field. Some energy is absorbed as nuclei switch to the high-energy state, and the absorption spectrum that results contains information about the identity of the nuclei and their immediate chemical environment. The data from many such experiments on a sample are averaged, increasing the signal-to-noise ratio, and an NMR spectrum such as that in Figure 4-31 is generated.

A four-part figure, a, b, c, and d, shows how N M R spectra are used to determine the structure of a globin. Part a shows a one-dimensional H 1 NMR spectrum, part b shows two-dimensional N M R data, part c shows how this information is used to work out part of the three-dimensional structure of the molecule, and part d shows the complete three-dimensional structure with multiple lines to represent the family of consistent structures. — FIGURE 4-31 NMR spectra of a globin from a marine blood worm. (a) One-dimensional $mathml alt text865a$ $Superscript 1 Baseline upper H$ NMR spectrum. (b) Two-dimensional NMR data used to generate a three-dimensional structure of globin. The diagonal in a two-dimensional NMR spectrum is equivalent to a one-dimensional spectrum. The off-diagonal peaks are NOE signals generated by close-range interactions of $mathml alt text865b$ $Superscript 1 Baseline upper H$ atoms that may generate signals quite distant in the one-dimensional spectrum. Two such interactions are identified in (b), and their identities are shown with blue lines in (c). Three lines are drawn for interaction 2 between a methyl group in the protein and a hydrogen on the heme. The methyl group rotates rapidly such that each of its three hydrogens contributes equally to the interaction and the NMR signal. Such information is used to determine the complete three-dimensional structure, as in (d). The multiple lines shown for the protein backbone in (d) represent the family of structures consistent with the distance constraints in the NMR data. [Data from (a, b) B. F. Volkman, National Magnetic Resonance Facility at Madison; (c) PDB ID 1VRF; (d) PDB ID 1VRE, B. F. Volkman et al., *Biochemistry* 37:10,906, 1998.]

Part a shows a graph with superscript 1 H chemical shift on the horizontal axis ranging from 10 to beyond minus 2, labeled in increments of 2. The line begins with small peaks, then has two slightly larger peaks at 8 and just past 8, then rises to two peaks at 7.4 and 6.3, then decreases with two small peaks, then has a peak at about one-fourth the height of the vertical axis at 4.9, then the entire line forms a curve with many small peaks up to a height of just under half the height of the vertical axis at 4.0, then it decreases to a height of about one-eighth the height of the vertical axis at 3.2, has a sharp peak at just under half the height of the vertical axis at 3.0, then increases to a sharp peak of one-half the height of the vertical axis at 1.8, then has multiple peaks up to seven eights of the height of the vertical axis at 1.5, drops to about half the height of the vertical axis at 1.4, then to about one-third the height of the vertical axis before a sharp peak at almost the top of the vertical axis at 0.8, then a peak at just over half the height of the vertical axis at 0.5, then a decrease with smaller peaks to just above the horizontal axis at 0.0 with a few smaller peaks until minus 1 and then a flat horizontal line until two small peaks at minus 2.2. Part b is a graph that plots superscript 1 H chemical shift in p p m on the horizontal axis from 10.0 to beyond minus 2, labeled in increments of 2, against superscript 1 H chemical shift in p p m on the right vertical axis with the same scale. A diagonal line extends from just past (10.0, 10.0) to (minus 1.0, minus 1.0). Small dots are visible through the graph, especially in a region between (5.0, 5.0) and (1.0, 5.0) at the bottom and (5.0, 1.0), (1.0, 1.0) at the top. A dotted vertical line extends from (7, 8) on the diagonal line to a square labeled 1 at (7.0, 4.0). A dotted horizontal line extends from the diagonal line at (4, 4) to the same square. A dotted vertical line extends from (9.5, 9.5) on the diagonal line to square 2 at (9.5, minus 1). A dotted horizontal line extends from (minus 1.0, minus 1.0) to the same point. All data are approximate. An arrow under a computer points down to part c, which shows a ball-and-stick model of multiple rings in a plane next to another ball-and-stick model with several spheres connected by lines, with helices in the background. An arrow under a computer points to part d, which shows a structure outlined by many lines forming roughly four pieces at the upper left, upper right, lower left, and lower right and with a highlighted multiring structure resembling the ring structure from part c at the upper left.

$mathml alt text865c$ $Superscript 1 Baseline upper H$ is particularly important in NMR experiments because of its high sensitivity and natural abundance. For macromolecules, $mathml alt text865d$ $Superscript 1 Baseline upper H$ NMR spectra can become quite complicated. Even a small protein has hundreds of $mathml alt text865e$ $Superscript 1 Baseline upper H$ atoms, typically resulting in a one-dimensional NMR spectrum too complex for analysis. Structural analysis of proteins became possible with the advent of two-dimensional NMR techniques (Fig. 4-31b, c, d). These methods allow measurement of distance-dependent coupling of nuclear spins in nearby atoms through space (the nuclear Overhauser effect (NOE), in a method dubbed NOESY) or the coupling of nuclear spins in atoms connected by covalent bonds (total correlation spectroscopy, or TOCSY).

Translating a two-dimensional NMR spectrum into a complete three-dimensional structure can be a laborious process. The NOE signals provide some information about the distances between individual atoms, but for these distance constraints to be useful, the atoms giving rise to each signal must be identified. Complementary TOCSY experiments can help identify which NOE signals reflect atoms that are linked by covalent bonds. Certain patterns of NOE signals have been associated with secondary structures such as α helices. Genetic engineering (Chapter 9) can be used to prepare proteins that contain the rare isotopes $mathml alt text865h$ $Superscript 13 Baseline upper C$ or $mathml alt text865i$ $Superscript 15 Baseline upper N$ . The new NMR signals produced by these atoms, and the coupling with $mathml alt text865f$ $Superscript 1 Baseline upper H$ signals resulting from these substitutions, help in the assignment of individual $mathml alt text865g$ $Superscript 1 Baseline upper H$ NOE signals. The process is also aided by a knowledge of the amino acid sequence of the polypeptide.

To generate a three-dimensional structure, researchers feed the distance constraints into a computer along with known geometric constraints such as chirality, van der Waals radii, and bond lengths and angles. The computer generates a family of closely related structures that represent the range of conformations consistent with the NOE distance constraints (Fig. 4-31d). The uncertainty in structures generated by NMR is in part a reflection of the molecular vibrations (known as breathing) within a protein structure in solution, discussed in more detail in Chapter 5. Normal experimental uncertainty can also play a role.

Thousands of Individual Molecules Are Used to Determine Structures by Cryo-Electron Microscopy

Our understanding of highly complex processes such as gene expression, mitochondrial respiration, or viral infection is aided immensely by knowing the detailed molecular structures of the proteins that participate in these processes. However, it is often difficult to determine the molecular structure of large, dynamic, macromolecular complexes that contain dozens of individual protein subunits. Moreover, integral membrane proteins often resist crystallization once they are removed from their lipid environment, making their structures difficult to solve by x-ray diffraction, and many are too large for NMR. In principle, discrete objects in the diameter range 100 to 300 Å can be visualized by electron microscopy (EM). In practice, the high intensity of the EM beam often damages the specimen before a high-resolution image can be obtained. In cryo-electron microscopy (cryo-EM), a sample containing many individual copies of the structure of interest is quick-frozen in vitreous (or noncrystalline) ice and kept frozen while being observed in two dimensions with the electron microscope, greatly reducing damage to the specimen by the electron beam.

Particles such as purified, multisubunit enzymes, arranged randomly on the microscope grid, are visualized with the cryo-electron microscope. When cryo-EM is combined with powerful algorithms for transforming the two-dimensional structures of tens of thousands of individual, randomly oriented complexes into a three-dimensional composite, it is sometimes possible to determine molecular structures at a level comparable to that obtained by x-ray crystallography (Fig. 4-32). In favorable cases, the repetitive aspects—choice of objects to be included in the analysis, imaging of each object individually, and calculations to produce a three-dimensional structure from the huge number of two-dimensional images—can be automated. The EMDataResource (www.emdataresource.org) is a unified resource for accessing structure maps deposited into data banks and assigned EMDataBank (EMDB) accession codes.

A two-part figure, a and b, shows the structure of G r o E L. Part a shows cryo-E M images of many particles and part b shows the reconstructed structure derived from analysis of the cryo-E M images. — FIGURE 4-32 Structure of the chaperone protein GroEL as determined by single-particle cryo-EM. (a) Cryo-EM images of many individual GroEL particles. (b) Side and top views of the three-dimensional structure derived from analysis of the EM images. [(b) Data from PDB ID 3E76, P. D. Kaiser et al., *Acta Crystallogr*. 65:967, 2009.]

Many novel structures have now been obtained by cryo-EM without models based on prior x-ray or NMR structures. Since cryo-EM relies on imaging of single molecules of a complex, this technique can also be used to computationally sort the imaged particles and simultaneously determine structures of multiple conformational states. Cryo-EM has now been used to solve the structures of some of the most dynamic and largest molecular complexes in the cell, such as the human telomerase enzyme (Fig. 4-33). Telomerase is an essential enzyme for maintaining chromosome integrity in humans (see Chapter 26) and is the target of significant medical research due to its roles in aging and cancer. Cryo-EM was critical for the laboratories of Eva Nogales and Kathleen Collins in determining the architecture of telomerase due to the heterogeneity of the complex and because only minute quantities could be purified from human cells — far too little for crystallization, but enough to observe single molecules by cryo-EM.

A figure shows the cryo-E M structure of human telomerase as ribbon structures surrounded by color-coded outlines. — FIGURE 4-33 Cryo-EM structure of human telomerase. The structures of the RNA (green) and protein (ribbon representations) components of human telomerase are shown embedded in the calculated 10.2 Å EM density map. [Data from EMDB ID EMD-7521, T. Nguyen et al., *Nature* 557:190, 2018.]

A photo of Kathleen Collins. — Kathleen Collins

SUMMARY 4.5 Determination of Protein and Biomolecular Structures

In x-ray crystallography, protein molecules are crystallized in well-ordered orientations that diffract x-rays. The patterns and intensities of the diffracted x-rays depend on the structure of the protein and its crystalline properties. Mathematical methods can then reconstruct the protein structure that produces a particular diffraction pattern.
NMR is often carried out on molecules in solution and yields information about atomic nuclei and their chemical environment. Protein structures can be computed from NMR data using hundreds of distance and geometric constraints obtained from multi-dimensional NMR experiments.
Biomolecules are frozen in vitreous ice for imaging by cryo-EM. The individual molecules are then identified and computationally sorted. The sorted two-dimensional images are then combined using computers to produce a three-dimensional structure.