It is easiest to understand the various ensemble types by starting with the ideal – the Boltzmann ensemble.
Boltzmann Ensemble. Statistical physics (a.k.a. statistical mechanics) tells us that protein conformations in solution will be distributed according to a suitable Boltzmann weighting. That is, a conformation (including protein plus solvent) denoted by x will occur with probability proportional to exp[–E(x) / k_{B}T], where T is the temperature and k_{B} is Boltzmann’s constant. E(x) is the energy (or enthalpy as appropriate).
But how many structures should be in a Boltzmann ensemble? This seems ambiguous at first glance. And the statistically correct answer also seems strange: any number you like! That is, you can have a Boltzmann ensemble of 10 structures or 10,000; both are equally valid. To see why, think of a simple scalar variable distributed according to a Gaussian distribution. It seems clear that one could generate 10 Gaussiandistributed numbers or 10,000. With only 10 numbers, of course, it would be uncommon (i.e., rare among many sets of 10 Gaussian distributed numbers) to find one of the ten more than two standard deviations from the mean. With 10,000 numbers, on the other hand, one would be likely to find some beyond even three standard deviations. In other words, the size of the ensemble just tells you how far out in the “tails” you are likely to probe. For the Boltzmann ensemble of a protein, this corresponds to the highenergy tail of the distribution.
Since we do not know the true value of the energy or enthalpy for a given conformation (i.e., what nature would measure, if she could!), we use a “model.” In particular, we use commonly available forcefields, along with implicit/continuum solvent models. Implicit solvent models do not fully account for the molecular structure of water, but are much easier to work with than fully explicit models. A key point is that it is a considerable computational challenge (never met, to our knowledge) to generate wellsampled Boltzmanndistributed protein structures even with an implicit solvent model. We will always specify the forcefield and solvent model used to generate an ensemble.
Approximate Ensembles. By Approximate we mean that an ensemble is not Boltzmanndistributed. However, as described below, the ensemble is generated in a physically reasonable way, so it should resemble a Boltzmann ensemble. We cannot, however, quantify the degree of resemblance. In other words, structures which might commonly occur in our ad hoc ensembles might not be common in a Boltzmann ensemble, and viceversa.
Our ad hoc ensembles are generated starting from coarsegrained models, with atomic detail added using existing software packages (developed by others). Our group has devoted considerable effort to developing rapid and physically reasonable alphacarbon models of proteins (e.g., [J. Phys. Chem. B, 108:51275137, 2004]). For the initial ensembles in the EPDB, we have used two such models: one with rigid peptide planes and one with flexible planes. Both employ simple Gōmodel interactions. These possess a number of attractive features: (i) they can be simulated rapidly, with demonstrable convergence [publication in preparation]; (ii) they exhibit large fluctuations about the native state, include the expected partial unfolding events; and (iii) because of the realistic peptideplane geometry, atomic detail is readily added. Atomic detail is added using one of the following: the RAPPER program [DePristo et al.], MaxSprout [Sander & Holm], or SCCOMP [Eyal et al.].
Even though our coarsegrained structures may be distributed according to some welldefined energy function, this does not mean the atomically detailed structures will be. To see this, imagine a protein with two basic states. Perhaps in the coarse model, these states are occupied 80% and 20%. However, these percentages will be roughly maintained if atomic detail is added to the coarse structures, which therefore represents only a rough guess for the true populations. (Perhaps the correct atomic percentages are 40% and 60%.) Really, the states need to be properly reweighted using the atomic forcefield. One rigorous means for doing so is our ‘Resolution Exchange’ approach [Phys. Rev. Lett., 96:028105, 2006; J. Chem. Theory Comp., 2:656666, 2006]. Another is "Blackbox Reweighting."
Approximate Path Ensembles. Approximate path ensembles consist of allatom configurations connecting the two indicated PDB structures. Paths were generated based on coarsegrained simulations. In the case of lymphotactin, the coarsegrained simulations consisted of inhouse library based Monte Carlo with full atomistic backbone, whereas in the case of calmodulin, the coarse grained simulations consisted of backbone alpha carbons. In both the cases, Maxsprout software package was utilized to render full atomistic details from the alpha carbon trace.
In the case of coarsegrained simulations containing full atomistic backbone (lymphotactin), the alpha carbon positions was used as an input to Maxsprout.
Why not molecular dynamics trajectories? Molecular dynamics (MD) trajectories are notorious for their undersampling. That is, MD simulations (typically < 100 nsec) fall orders of magnitude short of sampling the most biologically important motions, which occur on timescales ranging from μsec’s to seconds. Moreover, it is highly questionable whether a typical MD simulation will correctly ascertain the relative populations of the states it does manage to visit [Biophys. J. 91:164172 (2006)]. Conventional MD is hopeless. However, even more advanced sampling approaches, like replica exchange, are of uncertain value for large biomolecular systems [Lyman & Zuckerman, J. Chem. Theory Comp., 2:12001202 (2006); see also correction at www.ccbb.pitt.edu/Zuckerman/ ]
Statistical Rotamer Libraries. Our sidechain libraries may be characterized as statistical rotamer libraries. We include more than 1000 configurations of every side chain, and these are distributed according to the Boltzmann factor of the indicated forcefield. These are not energyminimized configurations, but structures representative of an equilibrium ensemble. Initial libraries will not distinguish among traditional rotamers (i.e., among torsional states defined by the chi angles). Future libraries will indeed include rotamer classification to permit the user to employ any chosen distribution of rotameric states.
The ensembles presented here are of high statistical quality. We have checked that all configurations are statistically independent – i.e., without correlations. Further details on statistical analysis may be found in papers by Lyman and Zuckerman [Biophys. J., 2006 and J. Phys. Chem. B, 2007] as well as in forthcoming work by Zhang, Mamonov, and Zuckerman [expected 2008].
Technical details. As noted, the configurations are distributed according to the Boltzmann factor of the indicated forcefield, in conjunction with the solvent model. The ensemble is sampled independently of the protein backbone, and hence no correlations with the backbone are accounted for. Calpha and Cbeta atoms were included as anchors for calculating internal coordinates of side chain atoms. The valence of Calpha was set to one and its charge was set to 0. The distance between CalphaCbeta atoms was constrained to the equilibrium value found in the force field. The analytical GB/SA model of Still [J. Phys. Chem. A, 101, 30053014 1997] was used as implemented in Tinker package.
Molecular Fragment Libraries. These are allatom Boltzmannfactor ensembles, distributed according to the indicated forcefield and solvent model at the temperature of 298 K. Such fragments are useful both for equilibrium sampling [to be described in a forthcoming paper] or free energy calculations. The necessary degrees of freedom for linking one fragment to another (i.e., specifying the relative distance and orientation) are embodied in the coordinates of dummy atoms. All configurations are statistically independent.
