Microsoft has released as open-source some code for analysis of antiviral immunity (http://atom.research.microsoft.com/bio/ ) They offer 4 tools: PhyloD, Epitope Predictor, HLA Completion, and HLA Assignment. The first two are particularly interesting to me.

PhyloD is

a statistical tool that can identify HIV mutations that defeat the function of the HLA proteins in certain patients, thereby allowing the virus to escape elimination by the immune system. By applying this tool to large studies of infected patients, researchers are now able to start decoding the complex rules that govern the HIV mutations, in the hope of one day creating a vaccine to which the virus is unable to develop resistance.

The reference is to Bhattacharya et al., Science 16 March 2007: Vol. 315. no. 5818, pp. 1583 – 1586. It’s work that arises directly out of Bruce Walker’s (and others, but mostly Walker’s) work on HIV immune escape variants, which dates back to the late 1990s. I want to talk about immune escape in HIV some time, but that’s going to be a long post and I have a grant due, so I’m just going to move on to the second interesting tool, the Epitope Predictor. “This tool computes the probability that a given kmer is a T-cell epitope restricted to a given HLA allele”; the reference is Heckerman et al., RECOMB 2006, which I haven’t read yet.

This is interesting to me because it’s something I’m working on directly as well. Epitope prediction is a remarkably difficult job to do well — it’s easy to take a first pass and drastically narrow down your possibilities, but getting an accurate end product is hard.

Epitopes, in this case,are sequences of amino acids that are cut out of the full-length protein and recognized by the T cells. A full-length protein might be 500 or 1000 or more amino acids long, whereas epitopes are typically 9 amino acids long. A generic virus, say HIV, will have thousands, tens of thousands, of peptides of the appropriate length. There are moderate constraints on what can be turned into epitopes, because the peptides have to bind to HLA molecules. (HLA, human leukocyte antigen, is the species-specific term for MHC, major histocompatibility complex. I tend to use MHC, but to avoid, or at least reduce, confusion, Il’l stick to HLA here.) HLA molecules have binding rules: “Anchor” positions of the peptide must fit certain pattterns. For example, a peptide that binds to one particular human MHC allele (HLA-A3) will usually have a leucine, valine, or methionine at position 2, a lysine, tyrosine, or phenylalanine at the last position, and is fairly likely to have one of two amino acids at position 3, one of five at position 6, and one of four at position 7. So still fairly broad, but much narrower than the 20 to the 9th possibilities with no restrictions at all.

Humans, like almost all vertebrates, are wildly complex at the MHC genes — you don’t have the same HLA type as your neighbour, and probably don’t even have exactly the same type as your sister. But let’s just focus for now on one HLA type, HLA-A2 (the most common HLA-A allele in North American caucasians), because I want to see how good the Microsoft epitope prediction is.

There are several other on-line epitope prediction tools, and I haven’t tried all of them. One is at syfpeithi.de, another is at iedb.org. I’ve also written a couple of my own, just for fun, that are very simple-minded and crude. My own, which I’ve tested more extensively than any others, tend to catch “real” epitopes (i.e. those that occur naturally) as one of the top ten or twenty possibilities — rarely are my best scores the real epitopes, but it’s also rare to have a complete miss that doesn’t catch one in the top twenty or so.

A recent paper (Kotturi et al., Journal of Virology, May 2007, p. 4928–4940) looked at epitope prediction quite exhaustively — again this is something I want to talk about more extensively at a future date — and the bottom line was that epitope prediction was really helpful; it narrowed their search from thousands of peptides (that only caught two-thirds of the real epitopes) to a couple hundred (that caught more like 90% — but still missed a significant number of real epitopes, and still had around 90% false positives).

So, and this isn’t a careful test, let’s throw a few examples at the predictions and see how we do. I used an HIV nef protein that has at least 7 known epitopes that bind to HLA-A2 (if you’re playing along at home, the epitopes are ILKEPVHGV, VIYQYMDDL, VLDVGDAYFSV,ALQDSGLEV, IYQYMDDLYV, ELVNQIIEQL, and KYTAFTIPSI).

SYFPEITHI’s prediction does pretty well, catching 5 of the 7 in their top 25 scores; their first and third best were both true hits, and the other five were lower down in their ranking.

The IEDB tool did poorly, only finding one of the true epitopes in its top 25 (though it did give that one its highest score). To be fair, this prediction site needs a lot more fiddling than the others, and I didn’t spend much time tweaking it.

My own script catches 3 of the 7 out of my top 25 scores, but none are in the top ten.

By comparison, the Epitope Predictor at http://atom.research.microsoft.com/bio/epipred.aspx (remember the Epitope Predictor? This here’s a post about the Epitope Predictor) catches 2 of the 7 correctly; ranking them number 1 and 3.

So the bottom line, I think, is not that Microsoft sucks, but rather that epitope prediction is hard. There’s plenty of room for improvement (that’s part of the grant I’m working on). From this single example, SYFPEITHI — the granddaddy of epitope prediction — is pretty good, but even a very crude approach (mine) isn’t all that much worse.

Potentially, pooling approaches could be useful. Only one of the seven epitopes here was not predicted by any of the systems I tried here; three were only predicted by one of the systems (SYFPEITHI caught two, I caught the other); and only one epitope was predicted by all four systems. On the other hand, there would be a lot more noise, too.

So how come epitope prediction is so hard?

More about that later.