ARB predictions, Peters et al 2006When I was talking about Microsoft’s epitope prediction software, and when I discussed Kotturi’s update on LCMV epitopes, I made the point that predicting MHC class I epitopes is hard. How come it’s so hard?

First let’s define the question. MHC class I, the target ligand for cytotoxic T lymphocyte recognition, binds peptides of about 9 amino acids. These peptides are generated during proteolysis within the cytosol of the target cells1. CTL recognize those peptides that are derived from abnormal proteins (viral or tumour, for example), while ignoring those that come from normal cellular proteins (“self”). An average virus might encode, let’s say, 10,000 amino acids, 2 so there’s 10,000 or so overlapping 9mers. Out of that potential ocean of peptides, there might be 10 or 20 that CTL see at all, and of those couple dozen only two or three of those are going to be good (“immunodominant”) epitopes.

So the question is: Given the sequence of amino acids encoded by a virus, can we point to the particular 9mers that CTL will react to?

To get an accurate answer, you’d need to do exhaustive scanning of all possible viral epitopes. This hasn’t been done much, but Kotturi et al3 did it and compared their findings to epitope prediction. Twenty-five of 160 predicted epitopes were real (16%) and their predictions missed three of 28 altogether (11%). 4

The two granddaddies of epitope prediction are BIMAS and SYFPEITHI. Kotturi used, I am pretty sure, either ARB MATRIX5 or something very close to it. (The figure at the top here is from Peters et al., Figure 2A: ARB Predictions for HLA-A*0201.) A more recent paper6 claims that pooling together multiple predictive methods gives higher accuracy than individual methods alone, but this isn’t available online:

The authors have elected not to make the HBM available online, for two reasons: first, frequent server outages and other problems with individual web-based tools often prevent acquisition of all the requisite scores. Automatic operation is therefore not possible. Second, the querying of all the web-based tools can take a long time, making the tool inconvenient for real-time web-based access. Interested researchers may, however, contact the authors regarding obtaining the scripts implementing the HBM.

There’s also the Microsoft tool I mentioned previously, as well as a bunch of other tools — the Trost and Peters papers both compare many of them.

I haven’t tested these myself, even to the extent of comparing predictions to database results (a crude measure). 7 So as far as I know (with the caveat that I haven’t followed this with rabid attention) the 16% positive/11% negative that Kotturi et al got is just about as good as anyone has done (and the ranking of tools in Trost et al shows ARB MATRIX as used by Kotturi et al. is only slightly worse than the pooled prediction tool they describe, so I wouldn’t expect much better results than that from other technologies). But still, some 15 years after MHC class I motifs were described — with the pathway at least reasonably well understood — 16% and 11% isn’t all that great. Why can’t we just point to the epitopes?

Here’s the components of the pathway that need to be taken into account to successfully predict a CTL epitope:

  1. Protein expression. Is there enough of the precursor protein available to yield enough epitope?
  2. Proteasome cleavage. The proteasome has to cut precisely at the carboxy terminus of the epitope, though there’s a little room for error at the amino terminus. Also, the proteasome must not cleave in the middle of the potential epitope.
  3. Peptidase destruction. The epitope has to survive destruction by a bunch of very active peptidases in the cytosol.
  4. Transport into the ER. The TAP peptide transporter that carries peptides across the ER membrane has clear sequence preferences.
  5. Trimming and destruction by ER peptidases. If the TAP-transported peptide is too long, can it be converted into the right form? If the mature epitope is there, will it be destroyed?
  6. Transport out of the ER. There’s a system that pumps peptides out out of the ER, but little if anything is known about it. Perhaps it’s just diffusion out of the Sec61 channel, or maybe it’s ERAD-related, or who knows what else..
  7. Binding to the MHC complex in the ER.
  8. Stimulating CTL. There’s a whole complicated set of interactions in that, too, but I’ll summarize it as a single step.
  9. Mystery factors that we don’t understand.

Of those 9 steps, I’d say that only one (TAP transport) is reasonably well defined as far as sequence requirements. Peptide binding to MHC class I is the next-best understood, though it’s not as simple as some people think. Protein expression level should be relatively easy, but it’s still not clear whether we need to look at total expression or levels of defective ribosomal products, or what. Predicting cleavage by the proteasome has been the subject of a lot of work, but it’s turned out to be a really difficult task; even the best algorithms are not, I think, very accurate. And I think there’s very little clue about most of the other factors.

I’ll talk more about each of the steps in other posts.


  1. I’ve said and typed that phrase so often that I’m pretty much on autopilot with it[]
  2. Divide by ten, multiple by ten, doesn’t much change the conclusion.[]
  3. Kotturi, M. F., Peters, B., Buendia-Laysa, F. J., Sidney, J., Oseroff, C., Botten, J., et al. (2007). The CD8+ T-cell response to lymphocytic choriomeningitis virus involves the L antigen: uncovering new tricks for an old virus. J Virol, 81(10), 4928-4940. []
  4. The quality of the predictions were not good, either, in that many of the strongly predicted epitopes only stimulated a very few CTL. As well, I’m being a little generous is granting them just 160 predictions; that’s the number they came up with post hoc as what they would have needed — in fact they tested 400 predicted epitopes. []
  5. Peters, B., Bui, H. H., Frankild, S., Nielson, M., Lundegaard, C., Kostem, E., et al. (2006). A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol, 2(6), e65.[]
  6. Trost, B., Bickis, M., & Kusalik, A. (2007). Strength in numbers: achieving greater accuracy in MHC-I binding prediction by combining the results from multiple prediction tools. Immunome Res, 3, 5.[]
  7. I’ll run some trials when I have time.[]