Mystery Rays from Outer Space

Meddling with things mankind is not meant to understand. Also, pictures of my kids

October 4th, 2007

XPlasMap 0.96

I’ve released a new version of XPlasMap, version 0.96 (asymptotically approaching a non-beta release). XPlasMap 0.96 can be downloaded here, and the XPlasMap home page is here.

XPlasMap is a DNA drawing program for MacOSX (MacOS10.4 and up only for this release; a slightly older version runs on MacOS10.3 [download XPlasMap10.3 here] ) It draws plasmid maps with all the features you’d expect (genes, multiple cloning sites, restriction sites, and so on), pretty much interactive. It also draws linear DNA maps and will draw maps by importing directly from GenBank files. It will also import from FastA files; for both FastA and GenBank sequence it will map out restriction sites (slowly! –it’s no competition for specialized restriction mapping programs like EnzymeX or the venerable DNA Strider) and identify open reading frames (again, slowly). Maps can be saved as .xpmp files (which is simply an XML format; I wanted to make sure that the information in the maps would remain accessible and in a non-proprietary format), or exported to PNG or JPG.

Here’s a sample plasmid map, for Invitrogen’s pTracerCMV2 (click on the image for a larger version):

pTracerCMV2 - XPlasMap

And here’s a sample of a linear DNA map (click for a larger version). This is the human genomic major histocompatibility region, imported directly from a GenBank file (3.7 million base pairs). The class I region is highlighted in orange, class III region in green, and class II region in blue.

HLA Genomic - XPlasMap

The 0.96 release is mainly a bug-fix release; there are preliminary versions of a couple of new features, with annotations being the main new feature.

New features:

  • Annotations
  • “Plasmid comment” is now free-form text (can be moved and edited)
  • (Preference option) Common actions on a toolbar

Bugfixes:

  • Improved print resolution
  • Fixed: Intermittent Clear Recent Files bug
  • Fixed: JPG and PNG exports use a large canvas with image only in one corner
  • Fixed: Error on copying in reverse
  • Fixed: Font preferences not always honored
  • Fixed: Going from linear to circular, genes disappear
  • Fixed: Contextual menus occasionally not responding
  • Fixed: Show/Hide enzyme lost after save
  • Fixed: Genes with no name disrupt drawing
  • Fixed: Freeze in Cut Plasmid
  • Fixed: Hiccup if no Preference file

Assorted other bugfixes and UI improvements

XPlasMap only runs on Macs (though it’s written in Python/wxPython, which means that it should be a straightforward recompile to run on other OSes — but I only use Macs so haven’t tried). I also wrote a much more primitive (but still rather attractive) on-line plasmid-mapping program that is OS-independent: Savage Plasmids draws SVG maps and exports to Postscript. Unfortunately browser support for SVG is still at best spotty and hasn’t improved much over the past couple of years, as far as I know.  SVG will do interactive, but I’ve never got around to making the program interactive (and likely never will, now), so XPlasMap really makes much nicer maps.

Share/Save/Bookmark

September 27th, 2007

Epitope prediction: The seven percent solution

How to catch flu (Wellcome Images) I’ve talked several times (for example, here, here, and here) about predicting cytotoxic T lymphocyte (CTL) epitopes, and emphasized how hard it is (or, at least, how poor the tools are). Here’s an example of why it’s difficult.

(Quick review: CTL recognize virus-infected cells by screening small peptides that are bound to the class I major histocompatibility complex [MHC class I]. The peptides are created by destruction of proteins in the target cell. There’s a handy guide to antigen presentation here, if that helps put things into context.)

In my previous post on the subject, I listed a bunch of different factors that need to be incorporated in the predictions. Number 7 was “Binding to the MHC complex in the ER”, and I commented that peptide binding to MHC class I is probably the second-best understood step in the pathway (behind TAP transport, if you’re keeping score at home).

A paper from earlier this year1 tried to identify CTL epitopes in influenza viruses. Lots of papers do this, but most don’t follow up with actual, complete tests — too expensive and difficult. Wang et al did the follow through.

They started by looking simply at binding to MHC class I alleles. Without going into details (they were looking for conserved epitopes that matched HLA supertypes, if anyone cares) they identified 167 peptides that they predicted should bind to the various MHC class I alleles; and then they tested them to see if they actually did bind. (They used NetMHC 3.0 2 to predict binding.)

Of the 167 predicted binders, 39 failed to bind altogether, and another 39 only bound very weakly. That leaves 89 peptides (just 53% of their tested pool) that were authentic binders.

Influenza viruses infecting cells of the trachea

Then, they tested to see if their peptides actually reacted with CTL from healthy donors. (They assumed that their healthy donors were immune to a influenza A — reasonable, but not a guarantee, so this is a particularly conservative test, I think.) Just 13 of their peptides were positive by this test (7.8% of their total predicted pool). Unexpectedly, two peptides that were non-binders triggered a response. Wang et al speculated that the very low affinity binding was enough for the CTL, but I wonder if this represented a contamination issue — CTL are famously sensitive, and it’s well known that tiny contaminating peptides in a synthetic prep are enough to trigger CTL, even if they’re barely detectable by other means.

 
 

The paper I’ve thought of as the record-holder for accuracy (if I’m being generous with their denominator) is Kotturi et al,3 whose prediction was correct for 25 of 160 potential peptides — about twice as good as the influenza predictions here. But Kotturi et al were dealing with just two MHC class I alleles, H-2Db and H-2Kb, and those are very intensively-studied alleles. Wang et al. are not only looking at multiple alleles, they were using supertype approaches that allow them to cover almost all (>99%) of the population — a much more difficult prediction. To me, then, their predictions are remarkably successful.

But still: Just over 7% of their predictions were correct. And even limiting to prediction to a single step in the complex pathway — just looking at MHC class I binding of the peptides — they’re barely above 50% accuracy.

It’s a hard job. But I have to say that the field is progressing with impressive speed; these predictions are much more accurate than I would have expected five years ago.

Share/Save/Bookmark


  1. Wang, M., Lamberth, K., Harndahl, M., Roder, G., Stryhn, A., Larsen, M. V., Nielsen, M., Lundegaard, C., Tang, S. T., Dziegiel, M. H., Rosenkvist, J., Pedersen, A. E., Buus, S., Claesson, M. H., and Lund, O. (2007). CTL epitopes for influenza A including the H5N1 bird flu; genome-, pathogen-, and HLA-wide screening. Vaccine 25, 2823-2831. []
  2. NetMHC is based on these three references — which I’m including as a note to myself: (1) Nielsen, M., Lundegaard, C., Worning, P., Hvid, C. S., Lamberth, K., Buus, S., Brunak, S., and Lund, O. (2004). Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 20, 1388-1397 . (2) Nielsen, M., Lundegaard, C., Worning, P., Lauemoller, S. L., Lamberth, K., Buus, S., Brunak, S., and Lund, O. (2003). Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci 12, 1007-1017 . (3) Buus, S., Lauemoller, S. L., Worning, P., Kesmir, C., Frimurer, T., Corbet, S., Fomsgaard, A., Hilden, J., Holm, A., and Brunak, S. (2003). Sensitive quantitative predictions of peptide-MHC binding by a ‘Query by Committee’ artificial neural network approach. Tissue Antigens 62, 378-384. []
  3. The CD8 T-Cell Response to Lymphocytic Choriomeningitis Virus Involves the L Antigen: Uncovering New Tricks for an Old Virus. Maya F. Kotturi, Bjoern Peters, Fernando Buendia-Laysa, Jr., John Sidney, Carla Oseroff, Jason Botten, Howard Grey, Michael J. Buchmeier, and Alessandro Sette. Journal of VIrology, May 2007, p. 4928–4940 []
August 23rd, 2007

Epitope prediction: The bad and the ugly

ARB predictions, Peters et al 2006When I was talking about Microsoft’s epitope prediction software, and when I discussed Kotturi’s update on LCMV epitopes, I made the point that predicting MHC class I epitopes is hard. How come it’s so hard?

First let’s define the question. MHC class I, the target ligand for cytotoxic T lymphocyte recognition, binds peptides of about 9 amino acids. These peptides are generated during proteolysis within the cytosol of the target cells1. CTL recognize those peptides that are derived from abnormal proteins (viral or tumour, for example), while ignoring those that come from normal cellular proteins (”self”). An average virus might encode, let’s say, 10,000 amino acids, 2 so there’s 10,000 or so overlapping 9mers. Out of that potential ocean of peptides, there might be 10 or 20 that CTL see at all, and of those couple dozen only two or three of those are going to be good (”immunodominant”) epitopes.

So the question is: Given the sequence of amino acids encoded by a virus, can we point to the particular 9mers that CTL will react to?

To get an accurate answer, you’d need to do exhaustive scanning of all possible viral epitopes. This hasn’t been done much, but Kotturi et al3 did it and compared their findings to epitope prediction. Twenty-five of 160 predicted epitopes were real (16%) and their predictions missed three of 28 altogether (11%). 4

The two granddaddies of epitope prediction are BIMAS and SYFPEITHI. Kotturi used, I am pretty sure, either ARB MATRIX5 or something very close to it. (The figure at the top here is from Peters et al., Figure 2A: ARB Predictions for HLA-A*0201.) A more recent paper6 claims that pooling together multiple predictive methods gives higher accuracy than individual methods alone, but this isn’t available online:

The authors have elected not to make the HBM available online, for two reasons: first, frequent server outages and other problems with individual web-based tools often prevent acquisition of all the requisite scores. Automatic operation is therefore not possible. Second, the querying of all the web-based tools can take a long time, making the tool inconvenient for real-time web-based access. Interested researchers may, however, contact the authors regarding obtaining the scripts implementing the HBM.

There’s also the Microsoft tool I mentioned previously, as well as a bunch of other tools — the Trost and Peters papers both compare many of them.

I haven’t tested these myself, even to the extent of comparing predictions to database results (a crude measure). 7 So as far as I know (with the caveat that I haven’t followed this with rabid attention) the 16% positive/11% negative that Kotturi et al got is just about as good as anyone has done (and the ranking of tools in Trost et al shows ARB MATRIX as used by Kotturi et al. is only slightly worse than the pooled prediction tool they describe, so I wouldn’t expect much better results than that from other technologies). But still, some 15 years after MHC class I motifs were described — with the pathway at least reasonably well understood — 16% and 11% isn’t all that great. Why can’t we just point to the epitopes?

Here’s the components of the pathway that need to be taken into account to successfully predict a CTL epitope:

  1. Protein expression. Is there enough of the precursor protein available to yield enough epitope?
  2. Proteasome cleavage. The proteasome has to cut precisely at the carboxy terminus of the epitope, though there’s a little room for error at the amino terminus. Also, the proteasome must not cleave in the middle of the potential epitope.
  3. Peptidase destruction. The epitope has to survive destruction by a bunch of very active peptidases in the cytosol.
  4. Transport into the ER. The TAP peptide transporter that carries peptides across the ER membrane has clear sequence preferences.
  5. Trimming and destruction by ER peptidases. If the TAP-transported peptide is too long, can it be converted into the right form? If the mature epitope is there, will it be destroyed?
  6. Transport out of the ER. There’s a system that pumps peptides out out of the ER, but little if anything is known about it. Perhaps it’s just diffusion out of the Sec61 channel, or maybe it’s ERAD-related, or who knows what else..
  7. Binding to the MHC complex in the ER.
  8. Stimulating CTL. There’s a whole complicated set of interactions in that, too, but I’ll summarize it as a single step.
  9. Mystery factors that we don’t understand.

Of those 9 steps, I’d say that only one (TAP transport) is reasonably well defined as far as sequence requirements. Peptide binding to MHC class I is the next-best understood, though it’s not as simple as some people think. Protein expression level should be relatively easy, but it’s still not clear whether we need to look at total expression or levels of defective ribosomal products, or what. Predicting cleavage by the proteasome has been the subject of a lot of work, but it’s turned out to be a really difficult task; even the best algorithms are not, I think, very accurate. And I think there’s very little clue about most of the other factors.

I’ll talk more about each of the steps in other posts.

Share/Save/Bookmark


  1. I’ve said and typed that phrase so often that I’m pretty much on autopilot with it[]
  2. Divide by ten, multiple by ten, doesn’t much change the conclusion.[]
  3. Kotturi, M. F., Peters, B., Buendia-Laysa, F. J., Sidney, J., Oseroff, C., Botten, J., et al. (2007). The CD8+ T-cell response to lymphocytic choriomeningitis virus involves the L antigen: uncovering new tricks for an old virus. J Virol, 81(10), 4928-4940. []
  4. The quality of the predictions were not good, either, in that many of the strongly predicted epitopes only stimulated a very few CTL. As well, I’m being a little generous is granting them just 160 predictions; that’s the number they came up with post hoc as what they would have needed — in fact they tested 400 predicted epitopes. []
  5. Peters, B., Bui, H. H., Frankild, S., Nielson, M., Lundegaard, C., Kostem, E., et al. (2006). A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol, 2(6), e65.[]
  6. Trost, B., Bickis, M., & Kusalik, A. (2007). Strength in numbers: achieving greater accuracy in MHC-I binding prediction by combining the results from multiple prediction tools. Immunome Res, 3, 5.[]
  7. I’ll run some trials when I have time.[]
August 1st, 2007

Software special

My computer was having intermittent hard drive problems, and last week they got bad enough that I sent the Macbook Pro off for repair (free under warranty). The good news is they fixed it, as far as I can see; the bad news is they dropped in a new hard drive, so I’m starting from scratch.

I’m pretty paranoid about backups (four times a day, to external drives in two different locations) so I didn’t lose any significant dat. The problem when this happens (the second time in a couple years) or when I get a new computer, is that I have to reload all the third-party applications I use; and that means figuring out what I need, finding them on the net, reinstalling, finding registration numbers in my saved mail, and so forth.

To save myself some time next time this happens, I’m noting down the apps I’ve reinstalled this time. And in case anyone else cares, here they are.

Commercial apps.

Commercial
Comments
Photoshop I hardly ever use this, but it’s nice to have on hand
BookEnds Reference and bibliography software. Much better than EndNote
DNA Strider DNA/protein analysis and enzyme mapping. There are probably better free products out there, but 17 years of finger memory is hard to quit
Flowjo Flow cytometry analysis. Registration is tied to the computer, need to update via the company
iWorks Pages and Keynote from Apple. I don’t know if I’m actually going to reinstall this, I hardly ever use it
MS Office Word,Excel, PowerPoint. Yes, I’ve tried the free alternatives. No, I don’t like them.

Freeware, donationware, and shareware.

Free- & shareware
Comments
4Peaks DNA Sequencing reader
Adium Instant messaging client for AIM, Jabber, MSN, Yahoo
Adium themes “Old Phone” dock icon
“A Little Less Than Minimal” contact list
“Good Grey” message style
Big Cat Scripts Contextual menu applescripts. Very useful and customizable
AppleWorks I still have one legacy database in this that I haven’t transfered to SQLite
CyberDuck FTP/sFTP
EnzymeX DNA/Protein analysis. I should probably use this instead of Strider, but old habits die hard
FastScripts Lite Menubar access to Applescripts. Not much difference from the builtin scripts folder
Firefox Not my default browser right now, but useful for some things
Firefox extensions Adblock Plus
Adblock Filterset
Quick Proxy
Download Statusbar
Gleam Flickr uploader
Google Earth  
Google Hosted Mail Notifier Menubar notice when I get new mail in my hosted gmail
Growl System notifications, integrated with several other apps. Also remember to install growlnotify
Journler I haven’t paid for this yet but probably will soon. Flawed, but better than the other organizers/notebooks I’ve tried.
JungleDiskMonitor For my Amazon S3 account. Still in beta. I will probably pay for it when it’s out of beta, but haven’t decided for sure yet
Magic Number Machine Stupid name and dock icon, but a good scientific calculator
MagiCal Customize menu time and date display
MenuMeters Menubar CPU use, also quick access to the Console
PandoraMan For the Pandora Streaming Radio service, which kicks ass
Safari 3 My default browser on Intel Macs, though it was less stable on PowerPC macs. Faster and smoother than Safari 2
ServiceScrubber Eliminate crap from the Service menu, so you can find the few services that are actually useful
SlimBatteryMonitor Uses less space in the menubar than the builtin battery monitor
Stuffit Horrible company, but I still need the expander occasionally
Synergy iTunes monitor and controller. I got a license for this years ago when it was by far the best controller, now competitors (including freeware) are catching up
Synk 5 Backup software. I use it a lot. (The latest version is 6, but I wasn’t impressed by it and already have a licence for the previous version so stuck with v.5)
TextWrangler Programming editor. Nothing else I’ve tried has been as versatile.
Vienna RSS reader
VLC Video viewer
WeatherMenu Weather in the menubar. Now freeware, but I got a licence many years ago when it was by far the best available - now there are competitors that are close
Windows Media Player I used to need this for baseball highlights and listening to games on MLB’s stream. Don’t know if I still need it, but hey.
X11 Apple’s verions of the X window system
XJournal Livejournal blogging client
XCode Apple Developer Tools
XMenu Menubar access to apps. There are a million different apps for this, this is just as far as I got before I found one I was comfortable with

Programming in Python. Apple includes Python2.3 and wxPython2.6 (or so), which are a few versions back of the latest. I update to Python2.4 (the latest is 2.5, which I haven’t moved to yet) and the latest wxPython (2.8.4 as of now).

Python/modules Comments
Python2.4 Programming language. Packaged for OSX here
wxPython2.8.4 For GUI integration
NumPy Numeric modules. OSX package
PySQLite SQLite database API. OSX package
Python Imaging Library (PIL) OSX package
ElementTree XML parsing
mxBase tools Required for BioPython
py2app Turn Python scripts into standalone apps
PyObjC Python/Objective C bridge
appscript Manage apple events with Python (just like Applescripts)
BioPython Many bioscience-related modules
DarwinPorts Unix-type apps ported to Darwin. I don’t use this just for Python, but it fits here better than elsewhere, I guess.

Dashboard widgets. I used to use widgets a lot, but have gradually moved away. I still use some of these regularly, though.

Widgets Comments
Reminder Widget Minimalist reminders (uses growlnotify). I wrote this one
Translate Widget DNA formating, translation, analysis. I wrote this one too
Digital World Clock Know the time for my sister in Switzerland and my brother in Beijing
Flip Clock Widget A clock for local time
iCal Events Widget Upcoming events from iCal
Scoreboard Widget Red Sox game tracking. Like I don’t already have it on the radio
AlbumArt Widget iTunes track indicator, with album art
PEMDAS Widget Calculator widget
Package Tracker Track FedEx, UPS, DHL pacakges

Applescripts for /Library/Scripts. These are just things I slapped together myself; here to remind myself to migrate them over. If anyone cares I can make them available, but they’d likely need tweaking for general use.

Applescripts Comments
File from Safari URLs Save the URLs from all open tabs into a text file
File to Safari URLs Open URLs from a text file into Safari
URLs -> Journler Save URLs from all opens tabs into Journler
Journler URLS -> Safari Open all URLs from a Journler page in Safari
Format Excel Chart Why doesn’t Excel let you save a default chart format? This makes the half-dozen formating changes I invariably set
Format sequence Take a DNA orprotein sequence, make a tidy numbered output
Images size adjust Quickly adjust the size of selected images
Strip sequence Remove all non-DNA characters (including spaces and newlines) from a DNA sequence
Track -> XJournal XJournal reads iTunes tracks, but doesn’t format classical properly. This is a fix for classical tracks

Finally, remember to copy back:

Files Comments
~/Library/Keychains/login.keychain Don’t lose your passwords
~/.bash_profile Terminal shortcuts and formats
~/.pythonstartup Python preferences

Share/Save/Bookmark

July 30th, 2007

Artificial immune systems

The July issue of Nature Reviews Immunology has an intriguingly-titled opinion piece by I.R. Cohen, “Real and artificial immune systems: computing the state of the body.” 1 Maybe I’m missing something, but I find it quite disappointing. Part of that is probably that I’m not exactly the audience he’s after, but I think that’s not all of it.

The paper starts off in a humble and apologetic tone with the explanation that “I attempt to show that reframing our view of the immune system in computational terms is worth our while”. To me it’s self-evident that framing the immune system in this way could be worthwhile (so long as we don’t abandon other frames, of course). I can easily imagine that there are lots of immunologists out there who are skeptical of it, though, and presumably those are the intended audience for the introduction.

Organic computer (Woody Igou)Most of the first half of the paper seems to be a pretty basic introduction to some computational concepts — they’re concepts I’m familiar with, which means they must be basic. A couple of mildly interesting comments arise here, particularly the notion that “the immune system effectively computes the immunogenic state of the body”. I don’t think this is a deeply profound thought, just a re-framing (as Cohen says — I’m not criticizing here, that’s what he said he was doing) of an overview of the immune system. Another interesting point is Cohen’s contrast of the immune system and a Turing system, in that the former is self-organizing: “It may therefore be said that the immune system creates and modifies its own program as it goes”.

At this point I was thinking that these were some interesting turns of phrase, but was wondering whether it was just semantics. A new approach to a field is interesting as far as it generates new questions or answers that can be tested experimentally, so I was looking, in the second half of the paper, for some examples of this; at least some of the kinds of questions that could be addressed. This was where I was really disappointed. The examples he offers as powerful outcomes of “reframing immune-system behavior in computational terms” don’t seem to me to be particularly powerful, or particularly dependent on the reframing.

He talks about “Natural immune reactivity to self-antigens” as one example, and “Assessing states of stress” as another. He claims that “The computational view of the immune system sees natural autoimmunity as a physiological mechanism for detecting and responsing to the states of body cells and tissues”, and contrasts this to the “mainstream” view which has “little tolerance … for the idea that natural autoimmunity could serve some useful purpose”.

First, at least as I remember it, this notion of autoimmunity as functional is one that has repeatedly popped up throughout the history of immunology (so it’s not something that computational immunology has a unique handle on), and the reason it’s not widely accepted is not that it’s been rejected mindlessly, but that there’s never been much or good evidence put forward for it. Rephrasing an old idea in the sparkly new wrapping du jour isn’t helpful unless it offers a new way to test it — which I don’t see here.

The other problem is that this seems to come out of the blue. He claims that the computational view sees autoimmunity in this way but doesn’t really show the connections; to be honest it comes across to me as someone with a particular hobbyhorse, trying to drum up support for an old idea. In other words, I’m neither convinced that this concept is either new to computational immunology, nor that it’s particularly strongly suggested by computational immunology. To be fair, this is a short section in a short review paper, and in a longer treatment I might be more convinced by the argument.

I have the same concerns about the “states of stress” argument, though I will say that this basic concept is one I find much more plausible — though again I’m not convinced it’s particularly unique to this approach, or that it’s particularly strongly suggested by this approach.

So I’m a sympathetic audience to the basic concept that a computational approach to immunology could have some useful outcomes, but I’m not blown away by the examples in this paper. I’d like to see more explanation of the kinds of questions and answers this approach could provide, with concrete examples.

Share/Save/Bookmark


  1. Real and artificial immune systems: computing the state of the body. Cohen IR. Nat Rev Immunol. 2007 Jul;7(7):569-74. []
June 14th, 2007

Epitopes and Microsoft Computational Biology

Microsoft has released as open-source some code for analysis of antiviral immunity (http://atom.research.microsoft.com/bio/ ) They offer 4 tools: PhyloD, Epitope Predictor, HLA Completion, and HLA Assignment. The first two are particularly interesting to me.

PhyloD is

a statistical tool that can identify HIV mutations that defeat the function of the HLA proteins in certain patients, thereby allowing the virus to escape elimination by the immune system. By applying this tool to large studies of infected patients, researchers are now able to start decoding the complex rules that govern the HIV mutations, in the hope of one day creating a vaccine to which the virus is unable to develop resistance.

The reference is to Bhattacharya et al., Science 16 March 2007: Vol. 315. no. 5818, pp. 1583 - 1586. It’s work that arises directly out of Bruce Walker’s (and others, but mostly Walker’s) work on HIV immune escape variants, which dates back to the late 1990s. I want to talk about immune escape in HIV some time, but that’s going to be a long post and I have a grant due, so I’m just going to move on to the second interesting tool, the Epitope Predictor. “This tool computes the probability that a given kmer is a T-cell epitope restricted to a given HLA allele”; the reference is Heckerman et al., RECOMB 2006, which I haven’t read yet.

This is interesting to me because it’s something I’m working on directly as well. Epitope prediction is a remarkably difficult job to do well — it’s easy to take a first pass and drastically narrow down your possibilities, but getting an accurate end product is hard.

Epitopes, in this case,are sequences of amino acids that are cut out of the full-length protein and recognized by the T cells. A full-length protein might be 500 or 1000 or more amino acids long, whereas epitopes are typically 9 amino acids long. A generic virus, say HIV, will have thousands, tens of thousands, of peptides of the appropriate length. There are moderate constraints on what can be turned into epitopes, because the peptides have to bind to HLA molecules. (HLA, human leukocyte antigen, is the species-specific term for MHC, major histocompatibility complex. I tend to use MHC, but to avoid, or at least reduce, confusion, Il’l stick to HLA here.) HLA molecules have binding rules: “Anchor” positions of the peptide must fit certain pattterns. For example, a peptide that binds to one particular human MHC allele (HLA-A3) will usually have a leucine, valine, or methionine at position 2, a lysine, tyrosine, or phenylalanine at the last position, and is fairly likely to have one of two amino acids at position 3, one of five at position 6, and one of four at position 7. So still fairly broad, but much narrower than the 20 to the 9th possibilities with no restrictions at all.

Humans, like almost all vertebrates, are wildly complex at the MHC genes — you don’t have the same HLA type as your neighbour, and probably don’t even have exactly the same type as your sister. But let’s just focus for now on one HLA type, HLA-A2 (the most common HLA-A allele in North American caucasians), because I want to see how good the Microsoft epitope prediction is.

There are several other on-line epitope prediction tools, and I haven’t tried all of them. One is at syfpeithi.de, another is at iedb.org. I’ve also written a couple of my own, just for fun, that are very simple-minded and crude. My own, which I’ve tested more extensively than any others, tend to catch “real” epitopes (i.e. those that occur naturally) as one of the top ten or twenty possibilities — rarely are my best scores the real epitopes, but it’s also rare to have a complete miss that doesn’t catch one in the top twenty or so.

A recent paper (Kotturi et al., Journal of Virology, May 2007, p. 4928–4940) looked at epitope prediction quite exhaustively — again this is something I want to talk about more extensively at a future date — and the bottom line was that epitope prediction was really helpful; it narrowed their search from thousands of peptides (that only caught two-thirds of the real epitopes) to a couple hundred (that caught more like 90% — but still missed a significant number of real epitopes, and still had around 90% false positives).

So, and this isn’t a careful test, let’s throw a few examples at the predictions and see how we do. I used an HIV nef protein that has at least 7 known epitopes that bind to HLA-A2 (if you’re playing along at home, the epitopes are ILKEPVHGV, VIYQYMDDL, VLDVGDAYFSV,ALQDSGLEV, IYQYMDDLYV, ELVNQIIEQL, and KYTAFTIPSI).

SYFPEITHI’s prediction does pretty well, catching 5 of the 7 in their top 25 scores; their first and third best were both true hits, and the other five were lower down in their ranking.

The IEDB tool did poorly, only finding one of the true epitopes in its top 25 (though it did give that one its highest score). To be fair, this prediction site needs a lot more fiddling than the others, and I didn’t spend much time tweaking it.

My own script catches 3 of the 7 out of my top 25 scores, but none are in the top ten.

By comparison, the Epitope Predictor at http://atom.research.microsoft.com/bio/epipred.aspx (remember the Epitope Predictor? This here’s a post about the Epitope Predictor) catches 2 of the 7 correctly; ranking them number 1 and 3.

So the bottom line, I think, is not that Microsoft sucks, but rather that epitope prediction is hard. There’s plenty of room for improvement (that’s part of the grant I’m working on). From this single example, SYFPEITHI — the granddaddy of epitope prediction — is pretty good, but even a very crude approach (mine) isn’t all that much worse.

Potentially, pooling approaches could be useful. Only one of the seven epitopes here was not predicted by any of the systems I tried here; three were only predicted by one of the systems (SYFPEITHI caught two, I caught the other); and only one epitope was predicted by all four systems. On the other hand, there would be a lot more noise, too.

So how come epitope prediction is so hard?

More about that later.

Share/Save/Bookmark

|