Homology modeling created a full-atom model of human blood complement Factor D based on serine protease structures deposited in the Brookhaven Protein Databank (PDB) (Bernstein et al, 1977). A manual model built using techniques based on a presentation by Greer (1988) gave a preliminary solution of the crystal structure via molecular replacement. The crystallographic refinement of Factor D to 2.0 A based on MIR data is described by Narayana et al (1994). Model phases were used only to locate heavy atom sites. This refined crystal structure is of high quality; the coordinates are thus assumed to be correct and provide the basis for the comparisons that follow. Factor D a single polypeptide chain of 228 residues which crystallizes in the triclinic space group P1 with two independent molecules per asymmetric unit. The final coordinates of the two independent Factor D monomers are referred to as ``FDA'' and ``FDB'', for the A and B chains.
An automated model was created by the Protein Design module of Quanta (Polygen Molecular Simulations, Inc., 200 Fifth Avenue, Waltham, MA 02254) for comparison. The initial homology models are referred to as ``FDM'' for the manually (M) constructed and ``FDQ'' for the Quanta (Q) generated coordinates. Two copies (A,B) of these models were placed in the unit cell and subjected to simulated annealing (SA) refinement (Brunger et al, 1987) with X-PLOR (X), followed by individual atomic temperature factor refinement against the 2.0 A data set of 23249 reflections. This produced the models FDMAX, FDMBX, FDQAX, and FDQBX for the two independent monomers refined from each starting homology model. The original molecular replacement model was subjected to SA refinement as new native data sets became available. The monomers in this model, subjected to now five iterations (I), are referred to as FDIAX and FDIBX.
The modeling and refinement procedures that created the six model coordinate sets above are fully documented in the accompanying paper (Carson et al, 1994). Each partially refined model set (FDMX, FDQX, FDIX) has roughly the same deviation between sets as deviation from the crystal structure. The average root-mean-square (rms) deviations of these sets from the crystal structure is 1.2 A for mainchain and 2.8 A for sidechain atoms. However, these deviations are far from uniformly distributed. About 3/4 of the mainchain and 1/2 of the sidechains are considered to be correct within experimental error. About 1/7 of the mainchain has deviations greater than 1.0 A (see Tables 5 and 6 of the accompanying paper). Most of these gross errors are localized in key substrate binding loops.
There are some significant conformational differences between the independent monomers in the crystal structure. About half of the gross errors in the models occur in these regions. We adopted the criterion that a model agrees if the rms deviation limit is within twice the error suggested by the Luzzati plots (see accompanying paper). We take the portions where the superimposed FDA and FDB models agree within this limit to represent a restricted set of data with highest confidence. We also use the entire crystal structure for error analysis. Analysis of the crystal structure with the methodology developed here reveals that the FDB monomer fits the experimental data better, in particular the residue range from 41 to 48. Comparisons with only the FDB subunit provide another test set of data.
A variety of criteria of structural correctness suggested by other authors were applied to the models above. The large number of models allows for some statistical analysis.
The standard crystallographic R-factor (Sum(Fobs - Fcalc)/Sum(Fobs)) is a widely used indication of structure quality. Luzzati (1952) proposed a method for estimating coordinate errors by analysis of plots of R-factor as a function of resolution, assuming all errors are explained by the error in coordinate positions. Both methods give a single overall value for the structure.
It has been noted that R-factors alone are a rather poor indicator of structure quality (Branden and Jones, 1990). It is generally conceded that Luzzati plots underestimate the coordinate error; alternative methods have been proposed (Read, 1986).
The real-space fit residual (Jones et al, 1991) measures agreement between an electron density map and the atomic model. For each residue, all or a subset of the atoms are converted into a ``map'' by treating each atom as a gaussian density function with a given temperature factor. The model map is compared to the observed map over the gridpoints around the atoms. This R-value ranges from 0.0 (good) to 1.0 and is generally about 0.25 for correct structures, but may exceed 0.50 for flexible side chains extending into the solvent. Incorrectly folded structures are identified by runs of residues with high R-factors.
The real-space fit is usually evaluated with maps having Fourier coefficients of 2Fobs-Fcalc and phases calculated from the final model. This evaluation criterion is referred to as rsr. This R-factor may also be calculated from OMITMAPS (Bhat and Cohen, 1984) in an attempt to reduce the model bias. This evaluation criterion is referred to as orsr. These rsr methods where calculated with the program O (Jones et al, 1991).
A complementary approach is to calculate the R-factor with a sliding ``window'' of omitted residues (Rao, 1991). Each five successive residues are omitted from the model for structure factor calculations. If a given region has significant errors, the R-factor will improve with the window omitted.
An X-PLOR script was written to execute this procedure. An overall FFT scale factor and R-factor are determined for use in all subsequent calculations. The R-factor is then calculated for each omitted window of residues. The difference from the overall R-factor is taken and normalized by the mass of the atoms in the window. This index is then negated, so that poor regions have high values. This evaluation criterion is referred to as rwin.
The windowing method can also be applied to a test set of reflections, which have been omitted from the SA refinement using the free R value protocol recently advocated by Brunger (1992b). This evaluation criterion is referred to as rfree.
Experience has shown that correctly positioned model coordinates will not shift significantly, even upon heating to 4000 K with the X-PLOR SA protocol. This suggests that coordinate shifts during a refinement cycle may be an appropriate error function. An X-PLOR script is used to compute rms differences in atomic coordinates (A) between two models on a per residue basis, taking into account the symmetry of Asp, Glu, Tyr, and Phe residues. This is a useful tool for describing the differences between two structures. This evaluation criterion is referred to as rms.
It is general knowledge that higher atomic temperature factors (B-factors) correlate with exposed surface area, and that high B-factors may indicate error, as well as their intended purpose to model thermal disorder. An X-PLOR script is used to compute rms values of the B-factors (A**2) on a per residue basis. This evaluation criterion is referred to as bf. Another script determines the accessible surface area (A**2) (Lee and Richards, 1971). This evaluation criterion is referred to as surf.
A model can be evaluated by examining how well it conforms to expected torsion angle values. The freely rotatable dihedral angles of the protein backbone (phi/psi) which define the secondary structure are known to have favored values. This is explained on energetic grounds from the study of model peptides (Ramachandran et al, 1963), and leads to the popular representation commonly known as the Ramachandran plot.
The freely rotatable dihedral angles of protein sidechains (chis) are also expected to have favored values on energetic grounds (e.g., staggered versus eclipsed conformations). An investigation of highly refined structures (with no dihedral constraints applied) from the PDB has established a ``rotamer library'' of sidechain conformations in proteins (Ponder and Richards, 1987).
We used the rotamer library to select the ``best'' structures from the PDB for a subsequent analysis. A structure was selected if 80% of the side chains with dihedrals matched one of the preferred rotamers. Of the slightly more than 1000 independent protein chains in the PDB at the time, 382 were selected. Using this set of protein structures, phi/psi and chi1/chi2 values were mapped to 36 by 36 element (10 degree interval) grids creating distributions for each amino acid type. These grids give empirical Ramachandran-like plots, presented in Figure~1.
These distributions are used to assign a single value for mainchain or sidechain dihedrals. The grids are converted to standard deviation units. The value for an observed dihedral is determined from bi-linear interpolation of the grid. To plot these single dihedral values, we adopt the following convention: if the grid value is 0.0 (no observations), it is set to -1.0; if the grid value is greater than 1.0, it is clamped to 1.0. Finally, the dihedral value is set to 1.0 minus the grid value. This dihedral value then has a range from 0.0 (very good) to 2.0. This gives a mock potential energy function with steep walls for the disallowed regions. For non-glycine mainchain dihedrals, lookup is based on the combined table. For glycine, the best value of the combined table and the glycine table is chosen. For sidechain dihedrals, there is a table for each amino acid. This evaluation criterion is referred to as dihe.
Jones and Thirup (1986) have shown that proteins may be built from fragments of unrelated proteins, based on initial C-alpha postions. A preliminary presentation of the program O (Jones et al, 1989) outlined several methods for locating potential errors in a model. These ideas were implemented into Atom, our local version of FRODO (Jones, 1978). Each run of five consecutive C-alpha are fit against a library of 62 highly refined protein structures. The top ten matches, based on rms deviations of C-alpha positions, are saved. The rms of the ten deviations is computed. A low value is generally obtained for common secondary structure types. A high value indicates an unusual conformation or possible error in the structure. In a similar fashion, the position of the carbonyl oxygen at the middle of the five residue run is compared with the corresponding oxygen in the ten best fits. A deviation of over 2.0 A suggests that this peptide plane is incorrectly modeled and the carbonyl should be flipped. The evaluation criterion based on the structural database is referred to as db.
In a classic study (Novotny et al, 1984), proteins were deliberately misfolded and energy minimized. One could not distinguish the correct structure from the incorrect structure from the potential energy functions alone. Thus the commonly used potential energy functions would not be a sufficient test of correctness.
Eisenberg and co-workers (Bowie et al, 1991; Luthy et al, 1992) recently published a method to identify misfolded structures based on ``3D-profiles.'' These are computed from observed statistics involving secondary structure and solvent-accessible surface of the side chains, obtained from analysis of structures in the PDB. A value is computed for each residue, then a smoothing window of 21 residues applied. This score averages about 0.4 for correctly folded structures, but will fall below zero in incorrect structures. This evaluation criterion is referred to as 3D.
In PROLSQ (Hendrickson, 1980) and X-PLOR, the minimization function seeks to balance the contributions from the structure factors and the ideal geometry (PROLSQ) or potential energy (X-PLOR). The average deviation in bonds or angles for the entire structure is often given as an indicator of quality. Brunger has proposed that the geometric strain energy in a residue correlates with the error in the structure. This evaluation criterion is referred to as geom.
We take the refined crystal structure to be correct. Any difference between a model structure and the crystal coordinates is regarded as error. These errors are tabulated on a per residue basis, using all atoms, mainchain only, or sidechain only. The residues are partitioned into two sets: the entire molecule and a restricted set of highest confidence (see Models). Two measures of the error function are explored, denoted as rms and del-dihe.
The rms difference in coordinates between the crystal structure and a given model is determined as previously described. The difference between two structures may also be expressed in terms of dihedral angles. The differences in phi/psi or chi1/chi2 pairs are computed as euclidian distances in radians. For example, the mainchain del-dihe is: sqrt( Delta-phi**2 + Delta-psi**2 ).
The relationships between the error functions and the error evaluation criteria are monitored with the standard correlation coefficient. For the data sets a and b: correl(a,b) = <ab-<a><b>>/ sqrt(<a**2-<a>**2><b**2-<b**2>).
A general linear model was constructed to assess the independence of the
various criteria. The error model has the form:
Error = a0 + a1*Criterion1 + a2*Criterion2 + ... + aN*CriterionN
The a(i) are the scalar coefficients.