Mystery Rays from Outer Space

Meddling with things mankind is not meant to understand. Also, pictures of my kids

June 24th, 2015

Are there any proteins that, when sequenced, have segments that spell English or colloquial words?

The question on Quora was:

Are there any proteins that, when sequenced, have segments that spell English or colloquial words?

My answer was:

A typical protein is about 350 amino acids long.  I am not aware of any English or colloquial words that are 350 letters long.  Very few, if any, functional proteins are less than 20 amino acids long, which is still very long for English words.

Many protein sequences contain within them English words and names. ELVIS can be found in many proteins, but ELVISISALIVE hasn’t turned up yet.  CRICK can be found in many,  FRANKLIN appears once in a hypothetical protein from Treponema primitia (WP_010253273) and of course WATSON is impossible.

What’s the longest English word that can be found in the GenBank protein collection? Offhand, I don’t know (and it will change on a regular basis, at the rate the collection is growing).  I bet I can find it in a few lines of code, though, and if no one beats me to it I’ll take a shot at it tomorrow; it’s too late tonight.

Update: The longest more or less English word I can find in the human reference sequence protein database is “TARGETEER”, 9 letters long.  It’s found in several isoforms of “C12orf42”, e.g. uncharacterized protein C12orf42 isoform 1 [Homo sapiens].

I only looked in the human reference sequence library, not the complete protein database for NCBI, which would have taken too long for download (too long for the mild curiosity I had, anyway).  This database has 72,204 protein sequences in it, with a total length of 46,315,661 amino acids; average protein length 636.4, median length 467.0, geometric mean length 468.5, distribution looking like this:

For words, I used the builtin unix dict (on my computer, /usr/share/dict/words), which contains 235,886 more or less English words ranging from 1 through 24 letters long (THYROPARATHYROIDECTOMIZE, TETRAIODOPHENOLPHTHALEIN, SCIENTIFICOPHILOSOPHICAL, PATHOLOGICOPSYCHOLOGICAL, and FORMALDEHYDESULPHOXYLATE, if you’re playing Scrabble).

“TARGETEER” was the 119,925th-longest word in the dictionary, and since I started with the longest and worked down it was over halfway through the dictionary (50.8%) before I got the first hit.  All in all, it took close to an hour to run in the background, with no attempt whatsoever at optimizing the script.

March 20th, 2009

Software I like

Evernote example
This is searchable text in Evernote

It’s not like I have much influence, but I want to give a quick shout out to three pieces of software that I’m finding useful in the lab: DokuWiki, Evernote, and Dropbox.

Dokuwiki, I’ve mentioned before; I use it as my electronic lab notebook.  It has all the standard Wiki features – easy lnks, images and text — with the major advantage over several other wikis that the pages are essentially stored as plain text, making backups, searches, and futureproofing relatively easy. A little rsync magic means that the version on my laptop is auto-synced to a remote copy for backup.

Evernote is a note-storage service.  It has a web interface and desktop apps for Mac and Windows, as well as iPhone integration, and it has useful gimimicks like OCR that can make handwriting  searchable text.  The iPhone integration is what turned this from a mildly useful service to one I use every day.  All the scribbled notes I make to myself in the lab, I now take snapshots of; they’re dumped into Evernote, and then when I write up the experiment, or when I’m replicating it, I have exactly what I did at my fingertips.  And for things that otherwise take a thousand words (like the number of colonies I get from a transformation of a particular ligation), a photo can be a better explanation.  Dump the photo into Dokuwiki, and  I don’t have to wonder if “LOTS OF COLONIES” means a hundred, or what.

Evernote example
More searchable text

Dropbox is file sharing.  Put a dropbox folder on your computer, and anything you put in that folder is silently and promptly synced to any other computer you use — Mac, Windows, Linux.   Even more usefully, symlinks work; take any folders you’re working on, and put symlinks to them in Dropbox, and forget about anything else.  Any time you work on a file or add anything new, the changes are intantly synced and made available on all the other machines. There’s no iPhone client, yet, but iStorage and similar iPhone apps work beautifully with it; so I essentially have my entire computer in my pocket all the time.

But that’s not what makes it so useful in the lab.  It also allows folders to be shared between different people.  That means that for relatively large files and folders (flow cytometry runs in the 100 MB range, confocal experiments that are two or three times that) my students and tech don’t have to fuss with compression and emails and hunting me down with flash drives or whatever.  Just drop the experiment in the shared Lab Folder on any of the computers, and a moment later it silently appears on my laptop.  My collaborator in Greece and I are editing a grant application; it’s in our shared dropbox folder, and whenever he makes changes they’re instantly reflected on my machine, and vice versa.

All three of these are freeware, though Evernote and Dropbox have paid versions with higher capacity.  I haven’t needed them yet, but probably will eventually, and they make my life easy enough that I’ll be happy to shell out for them.

February 19th, 2009

Google vs. influenza

It seems that influenza is a popular target for internet-based research; perhaps because it’s so common and well-known that population trends can be picked up accurately this way.

Five scientists, and one from the CDC, have published evidence in Nature1 that Google search terms are accurate ways of measuring influenza epidemics.  Their influenza tool is available at (and has an explanation of the techniques involved).  Their accuracy seems pretty decent, as the figure below shows — red traces are CDC-recorded cases, black is the cases as predicted from Google searches.  

Google influenza searches
A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red)1

There are a surprising number of on-line maps for influenza and avian influenza, although as though far as I know they’re all much more descriptive (all based on reported cases) than Google’s version, which is (sort of) predictive.  For example, there’s the avian influenza outbreak map, various maps from the WHO , and the CDC’s set of maps.  (There’s also Bird Flu Breaking Newswhich occasionally links to my posts, but the site seems to be broken; too bad, because if I remember correctly, it had an interesting variant on maps that was conceptually related to Google’s — showing where new discussion on avian flu was located.)  

  1. Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, Larry Brilliant (2009). Detecting influenza epidemics using search engine query data Nature, 457 (7232), 1012-1014 DOI: 10.1038/nature07634[][]
February 3rd, 2009

Virus vs tumor (vs ninja vs pirate): Computational oncolysis

Jennerex oncolytic virus
Jennerex” virus (green) replicating within
a tumor mass.

Viruses infect cells, and quite often (depending on the type of virus) destroy the cell they’re infecting. Usually having your cells destroyed is a bad thing for the host, because you need those cells. But there are some cells that you don’t want to have, and it might be convenient to have a controlled virus destroy those cells for you.

The obvious example would be cancer cells, and in fact there’s a flourishing research industry that is trying to harness the destructive power of viruses to eliminate tumors. The trick, of course, is to have the virus infect only the cancer cells, not the normal healthy cells you want to keep; and the general approach is to take advantage of some of the common features of viruses and cancers. Cancers have to mutate to overcome some of the same cellular controls that viruses do (both viruses and cancers need to overcome the regulation of uncontrolled genome replication, for example). If you cripple some of these viruses, then, they can’t replicate in normal cells, but can replicate in, and destroy, cancer cells that have mutated the appropriate pathway.

(I talk about oncolytic viruses in more detail here.)

As often happens with these intriguing cancer treatments, oncolytic viruses seem to work sometimes, and not to work sometimes, and it’s not always clear why not. In a paper in PLoS One the other day,1 Wodarz and Komarova attempt to come up with a way of predicting which tumors will and will not respond to oncolytic virus therapy, using a computational approach.

As I think I’ve said before, I’m excited by the concept of computational biology. (I’m not so much talking about bioinformatics as such here, but rather about attempts to model AND PREDICT complex biological processes.) But I’ve been kind of disappointed by some of the process. It’s seemed to me that when simple processes are modeled we haven’t really learned very much new, and when complex processes are modeled the assumptions are often too simplistic to make a reasonable prediction. I don’t think we’re at a point where we can usefully model an immune system, for example.

Wodarz oncolysis model
Growth of cancer in a mouse in the presence of  oncolytic virus 1

However, I do think there are a class of problems in the middle where the approach is more successful, and though I’m not really able to critically assess their results I think over the years Dominik Wodarz has done a good job of identifying these problems. Questions like the emergence of drug resistance in cancer,2 effects on vaccination on HIV,3 and so on seem like the kind of problem where mathematical analysis can actually get a handle on the issues and help guide research to some extent. Again, I don’t feel that I can really judge the results, but I like the approach. What’s more, Wodarz seems to at least consider experimental evidence in his analyses, which isn’t always the case in these computational things.

(I don’t think it’s a coincidence that many of the questions he’s asked are microcosmic versions of ecological issues.  My impression is that population biology has a much longer and more successful history of mathematical analysis than do cell and molecular biology.)

This particular paper leads to a conclusion that, once reached, seems fairly obvious in hindsight, but it’s one I haven’t seen explicitly made before. (I am not in the field, and it may be taken for granted.  That said, one hallmark of a successful prediction is that everyone immediately says they knew it all along.) Briefly, tumor growth rate per se turn out to be relatively unimportant, and growth patterns are important. If the cancerous cells are relatively spread out in the tumor, then an oncolytic virus has a good chance of eliminating the tumor; whereas if the cancer cells are in clumps, the virus is much less effective. This is simply because in the clumpy masses most of the infected cells are contacting already-infected cells, and the only route to reach new targets is from the surface of the clump, so spread is inefficient.

In one group, virus growth is relatively fast because the infected cells are dispersed among the uninfected cells rather than being clustered together. In this case most infected cells contribute to virus spread. In these models, there is a clear viral replication rate threshold beyond which the number of cancer cells drops to levels of the order of one or less, corresponding to extinction in practical terms. … In the other category, infected cells are assumed to be clustered together to some degree in a mass, which might be realistic for solid tumors. In this case, only the infected cells located at the surface of the cluster contribute to virus spread because they are in the vicinity of uninfected cells. … In this scenario, virus therapy is more difficult. 1

  1. Dominik Wodarz, Natalia Komarova (2009). Towards Predictive Computational Models of Oncolytic Virus Therapy: Basis for Experimental Validation and Model Selection PLoS ONE, 4 (1) DOI: 10.1371/journal.pone.0004271[][][]
  2. Drug resistance in cancer: principles of emergence and prevention. Komarova NL, Wodarz D. Proc Natl Acad Sci U S A. 2005 Jul 5;102(27):9714-9.[]
  3. Immunity and protection by live attenuated HIV/SIV vaccines. Wodarz D. Virology. 2008 Sep 1;378(2):299-305.[]
July 24th, 2008

HIV and immunodominance, again

HIV modelOne of the reasons HIV can persist in infected people, in spite of a powerful and effective cytotoxic T cell immune response against the virus, is that the virus mutates rapidly. Because CTL each only target a short stretch of the genome (say, 9 amino acids) and a single amino acid change may allow the virus to escape recognition by a particular CTL clone, it may not take long for a viral mutant to arise that is invisible to the dominant CTL population in a particular individual.

It’s been suggested that immunodominance is one of the factors that determines the rate at which HIV can escape from a particular immune response. In a highly immunodominant response, most of the CTL specific for the virus all target a single peptide epitope. If the virus manages to mutate this peptide, it has escaped the bulk of the immune response, and the new mutant virus can explode unchecked (until a new CTL response arises).

On the other hand, if the CTL response isn’t dominated by a single epitope — that is, if the response is broad, targeting many peptides — the virus has to simultaneously mutate several regions of its genome, which is exponentially less probable than single mutations. On the other hand, typically a broad CTL response would have fewer cells attacking each individual epitope, so perhaps the overall control might not be as good during the peak response.

Directly analyzing these questions is a huge task. Identifying CTL epitopes isn’t easy even when there are a few of them; looking at HIV changes isn’t easy even when there’s a concrete starting point; and in an infected patient you would need to track CTL recognition and HIV changes at short intervals, and over a long period; a task even more complicated by all the variables of a massively diverse starting population, replication and fitness issues … just an overwhelming problem.

A paper in PLoS Computational Biology1 tries to model these possibilities.

Organic computer
Organic computer

I don’t feel competent to assess the model here, in any technical way. As with most bench scientists, I suspect, I’m at best cautious, and more often outright skeptical, about computer modeling of biological problems, especially when they’re as complex as these ones. For example, the authors list a dozen parameters they took from various sources — maximal CTL proliferation rate, natural death rate of CD4 cells, and so on. (Not to mention assumptions that aren’t explicit.) Lots of these parameters are offered as single numbers: 0.01 d-1 as the death rate of CD4 target cells. Naturally, each of those numbers would have error bars in the original, and probably weren’t all measured in comparable ways, and so on. I doubt anyone would be much surprised if any of those parameters was off by 50% or more; perhaps much more. Cumulatively, how much error is in there? Or do we count on having all the errors more or less cancel out?

Still (again, probably typical of bench scientists) I’m always intrigued by computer modeling, and I’m willing to accept that modeling might well open up a problem enough to suggest new approaches. Encouragingly, the model here fits observation reasonably well; escape variants pop up intermittently over a couple of years, CTL clones decline as their targets mutate away. The model looks rather similar, in some ways, to the study a couple of years ago on a pair of identical twins infected with HIV. 2

One interesting observation from the model is that escape variants are mostly all present within a couple of years of infection, though they may later reappear as if they are new as CTL pressure varies:

After about two years, the virus population stabilizes as the ‘easy’ escapes have been done, the replicative capacity is partially restored and only few escapes are expected to appear later during infection. … If an escape is found to happen late it does not necessarily mean that it had not been selected earlier during infection

An observation and prediction arising from this is that CTL may actually become more effective later in infection (all other things being equal, of course), as further attempts by the virus to escape bump up against more severe fitness costs for the virus.

Another observation is the effect of immunodominance. A highly immunodominant CTL response results in more escape variants, as predicted by other studies. However, since escape variants are usually less fit than the Platonic essence of HIV, even though there are more cells infected with virus, that virus is less fit; so even a highly immunodominant response may be surprisingly (to me) effective, by forcing the virus into an unfit state.

A higher degree of immunodominance leads to more frequent escape with a reduced control of viral replication but a substantially impaired replicative capacity of the virus.

Presumably (I don’t think the authors of this model addressed this directly) the effectiveness (quantitatively) of an immunodominant response would depend on the fitness cost — in other words, an immunodominant response that could be escaped with only a small loss in fitness would be ineffective, whereas one that forces a big hit in fitness to escape would be effective. That would reflect what we know about the connection between elite suppressors and particular MHC class I alleles associated with immunodominant epitopes.

I’ve been rather unimpressed by highly immunodominant responses to HIV, but if this model is accurate, such responses may not as bad as I thought; though broad responses are probably still more desirable.

  1. Althaus CL, De Boer RJ (2008) Dynamics of Immune Escape during HIV/SIV Infection. PLoS Comput Biol 4(7): e1000103. doi:10.1371/journal.pcbi.1000103[]
  2. Draenert R, Allen T, Liu Y, Wrin T, Chappey C, et al. (2006) Constraints on HIV-1 evolution and immunodominance revealed in monozygotic adult twins infected with the same virus. J Exp Med 203: 529-39[]
October 4th, 2007

XPlasMap 0.96

I’ve released a new version of XPlasMap, version 0.96 (asymptotically approaching a non-beta release). XPlasMap 0.96 can be downloaded here, and the XPlasMap home page is here.

XPlasMap is a DNA drawing program for MacOSX (MacOS10.4 and up only for this release; a slightly older version runs on MacOS10.3 [download XPlasMap10.3 here] ) It draws plasmid maps with all the features you’d expect (genes, multiple cloning sites, restriction sites, and so on), pretty much interactive. It also draws linear DNA maps and will draw maps by importing directly from GenBank files. It will also import from FastA files; for both FastA and GenBank sequence it will map out restriction sites (slowly! –it’s no competition for specialized restriction mapping programs like EnzymeX or the venerable DNA Strider) and identify open reading frames (again, slowly). Maps can be saved as .xpmp files (which is simply an XML format; I wanted to make sure that the information in the maps would remain accessible and in a non-proprietary format), or exported to PNG or JPG.

Here’s a sample plasmid map, for Invitrogen’s pTracerCMV2 (click on the image for a larger version):

pTracerCMV2 - XPlasMap

And here’s a sample of a linear DNA map (click for a larger version). This is the human genomic major histocompatibility region, imported directly from a GenBank file (3.7 million base pairs). The class I region is highlighted in orange, class III region in green, and class II region in blue.

HLA Genomic - XPlasMap

The 0.96 release is mainly a bug-fix release; there are preliminary versions of a couple of new features, with annotations being the main new feature.

New features:

  • Annotations
  • “Plasmid comment” is now free-form text (can be moved and edited)
  • (Preference option) Common actions on a toolbar


  • Improved print resolution
  • Fixed: Intermittent Clear Recent Files bug
  • Fixed: JPG and PNG exports use a large canvas with image only in one corner
  • Fixed: Error on copying in reverse
  • Fixed: Font preferences not always honored
  • Fixed: Going from linear to circular, genes disappear
  • Fixed: Contextual menus occasionally not responding
  • Fixed: Show/Hide enzyme lost after save
  • Fixed: Genes with no name disrupt drawing
  • Fixed: Freeze in Cut Plasmid
  • Fixed: Hiccup if no Preference file

Assorted other bugfixes and UI improvements

XPlasMap only runs on Macs (though it’s written in Python/wxPython, which means that it should be a straightforward recompile to run on other OSes — but I only use Macs so haven’t tried). I also wrote a much more primitive (but still rather attractive) on-line plasmid-mapping program that is OS-independent: Savage Plasmids draws SVG maps and exports to Postscript. Unfortunately browser support for SVG is still at best spotty and hasn’t improved much over the past couple of years, as far as I know.  SVG will do interactive, but I’ve never got around to making the program interactive (and likely never will, now), so XPlasMap really makes much nicer maps.

September 27th, 2007

Epitope prediction: The seven percent solution

How to catch flu (Wellcome Images) I’ve talked several times (for example, here, here, and here) about predicting cytotoxic T lymphocyte (CTL) epitopes, and emphasized how hard it is (or, at least, how poor the tools are). Here’s an example of why it’s difficult.

(Quick review: CTL recognize virus-infected cells by screening small peptides that are bound to the class I major histocompatibility complex [MHC class I]. The peptides are created by destruction of proteins in the target cell. There’s a handy guide to antigen presentation here, if that helps put things into context.)

In my previous post on the subject, I listed a bunch of different factors that need to be incorporated in the predictions. Number 7 was “Binding to the MHC complex in the ER”, and I commented that peptide binding to MHC class I is probably the second-best understood step in the pathway (behind TAP transport, if you’re keeping score at home).

A paper from earlier this year1 tried to identify CTL epitopes in influenza viruses. Lots of papers do this, but most don’t follow up with actual, complete tests — too expensive and difficult. Wang et al did the follow through.

They started by looking simply at binding to MHC class I alleles. Without going into details (they were looking for conserved epitopes that matched HLA supertypes, if anyone cares) they identified 167 peptides that they predicted should bind to the various MHC class I alleles; and then they tested them to see if they actually did bind. (They used NetMHC 3.0 2 to predict binding.)

Of the 167 predicted binders, 39 failed to bind altogether, and another 39 only bound very weakly. That leaves 89 peptides (just 53% of their tested pool) that were authentic binders.

Influenza viruses infecting cells of the trachea

Then, they tested to see if their peptides actually reacted with CTL from healthy donors. (They assumed that their healthy donors were immune to a influenza A — reasonable, but not a guarantee, so this is a particularly conservative test, I think.) Just 13 of their peptides were positive by this test (7.8% of their total predicted pool). Unexpectedly, two peptides that were non-binders triggered a response. Wang et al speculated that the very low affinity binding was enough for the CTL, but I wonder if this represented a contamination issue — CTL are famously sensitive, and it’s well known that tiny contaminating peptides in a synthetic prep are enough to trigger CTL, even if they’re barely detectable by other means.


The paper I’ve thought of as the record-holder for accuracy (if I’m being generous with their denominator) is Kotturi et al,3 whose prediction was correct for 25 of 160 potential peptides — about twice as good as the influenza predictions here. But Kotturi et al were dealing with just two MHC class I alleles, H-2Db and H-2Kb, and those are very intensively-studied alleles. Wang et al. are not only looking at multiple alleles, they were using supertype approaches that allow them to cover almost all (>99%) of the population — a much more difficult prediction. To me, then, their predictions are remarkably successful.

But still: Just over 7% of their predictions were correct. And even limiting to prediction to a single step in the complex pathway — just looking at MHC class I binding of the peptides — they’re barely above 50% accuracy.

It’s a hard job. But I have to say that the field is progressing with impressive speed; these predictions are much more accurate than I would have expected five years ago.

  1. Wang, M., Lamberth, K., Harndahl, M., Roder, G., Stryhn, A., Larsen, M. V., Nielsen, M., Lundegaard, C., Tang, S. T., Dziegiel, M. H., Rosenkvist, J., Pedersen, A. E., Buus, S., Claesson, M. H., and Lund, O. (2007). CTL epitopes for influenza A including the H5N1 bird flu; genome-, pathogen-, and HLA-wide screening. Vaccine 25, 2823-2831. []
  2. NetMHC is based on these three references — which I’m including as a note to myself: (1) Nielsen, M., Lundegaard, C., Worning, P., Hvid, C. S., Lamberth, K., Buus, S., Brunak, S., and Lund, O. (2004). Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 20, 1388-1397 . (2) Nielsen, M., Lundegaard, C., Worning, P., Lauemoller, S. L., Lamberth, K., Buus, S., Brunak, S., and Lund, O. (2003). Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci 12, 1007-1017 . (3) Buus, S., Lauemoller, S. L., Worning, P., Kesmir, C., Frimurer, T., Corbet, S., Fomsgaard, A., Hilden, J., Holm, A., and Brunak, S. (2003). Sensitive quantitative predictions of peptide-MHC binding by a ‘Query by Committee’ artificial neural network approach. Tissue Antigens 62, 378-384. []
  3. The CD8 T-Cell Response to Lymphocytic Choriomeningitis Virus Involves the L Antigen: Uncovering New Tricks for an Old Virus. Maya F. Kotturi, Bjoern Peters, Fernando Buendia-Laysa, Jr., John Sidney, Carla Oseroff, Jason Botten, Howard Grey, Michael J. Buchmeier, and Alessandro Sette. Journal of VIrology, May 2007, p. 4928–4940 []
August 23rd, 2007

Epitope prediction: The bad and the ugly

ARB predictions, Peters et al 2006When I was talking about Microsoft’s epitope prediction software, and when I discussed Kotturi’s update on LCMV epitopes, I made the point that predicting MHC class I epitopes is hard. How come it’s so hard?

First let’s define the question. MHC class I, the target ligand for cytotoxic T lymphocyte recognition, binds peptides of about 9 amino acids. These peptides are generated during proteolysis within the cytosol of the target cells1. CTL recognize those peptides that are derived from abnormal proteins (viral or tumour, for example), while ignoring those that come from normal cellular proteins (“self”). An average virus might encode, let’s say, 10,000 amino acids, 2 so there’s 10,000 or so overlapping 9mers. Out of that potential ocean of peptides, there might be 10 or 20 that CTL see at all, and of those couple dozen only two or three of those are going to be good (“immunodominant”) epitopes.

So the question is: Given the sequence of amino acids encoded by a virus, can we point to the particular 9mers that CTL will react to?

To get an accurate answer, you’d need to do exhaustive scanning of all possible viral epitopes. This hasn’t been done much, but Kotturi et al3 did it and compared their findings to epitope prediction. Twenty-five of 160 predicted epitopes were real (16%) and their predictions missed three of 28 altogether (11%). 4

The two granddaddies of epitope prediction are BIMAS and SYFPEITHI. Kotturi used, I am pretty sure, either ARB MATRIX5 or something very close to it. (The figure at the top here is from Peters et al., Figure 2A: ARB Predictions for HLA-A*0201.) A more recent paper6 claims that pooling together multiple predictive methods gives higher accuracy than individual methods alone, but this isn’t available online:

The authors have elected not to make the HBM available online, for two reasons: first, frequent server outages and other problems with individual web-based tools often prevent acquisition of all the requisite scores. Automatic operation is therefore not possible. Second, the querying of all the web-based tools can take a long time, making the tool inconvenient for real-time web-based access. Interested researchers may, however, contact the authors regarding obtaining the scripts implementing the HBM.

There’s also the Microsoft tool I mentioned previously, as well as a bunch of other tools — the Trost and Peters papers both compare many of them.

I haven’t tested these myself, even to the extent of comparing predictions to database results (a crude measure). 7 So as far as I know (with the caveat that I haven’t followed this with rabid attention) the 16% positive/11% negative that Kotturi et al got is just about as good as anyone has done (and the ranking of tools in Trost et al shows ARB MATRIX as used by Kotturi et al. is only slightly worse than the pooled prediction tool they describe, so I wouldn’t expect much better results than that from other technologies). But still, some 15 years after MHC class I motifs were described — with the pathway at least reasonably well understood — 16% and 11% isn’t all that great. Why can’t we just point to the epitopes?

Here’s the components of the pathway that need to be taken into account to successfully predict a CTL epitope:

  1. Protein expression. Is there enough of the precursor protein available to yield enough epitope?
  2. Proteasome cleavage. The proteasome has to cut precisely at the carboxy terminus of the epitope, though there’s a little room for error at the amino terminus. Also, the proteasome must not cleave in the middle of the potential epitope.
  3. Peptidase destruction. The epitope has to survive destruction by a bunch of very active peptidases in the cytosol.
  4. Transport into the ER. The TAP peptide transporter that carries peptides across the ER membrane has clear sequence preferences.
  5. Trimming and destruction by ER peptidases. If the TAP-transported peptide is too long, can it be converted into the right form? If the mature epitope is there, will it be destroyed?
  6. Transport out of the ER. There’s a system that pumps peptides out out of the ER, but little if anything is known about it. Perhaps it’s just diffusion out of the Sec61 channel, or maybe it’s ERAD-related, or who knows what else..
  7. Binding to the MHC complex in the ER.
  8. Stimulating CTL. There’s a whole complicated set of interactions in that, too, but I’ll summarize it as a single step.
  9. Mystery factors that we don’t understand.

Of those 9 steps, I’d say that only one (TAP transport) is reasonably well defined as far as sequence requirements. Peptide binding to MHC class I is the next-best understood, though it’s not as simple as some people think. Protein expression level should be relatively easy, but it’s still not clear whether we need to look at total expression or levels of defective ribosomal products, or what. Predicting cleavage by the proteasome has been the subject of a lot of work, but it’s turned out to be a really difficult task; even the best algorithms are not, I think, very accurate. And I think there’s very little clue about most of the other factors.

I’ll talk more about each of the steps in other posts.

  1. I’ve said and typed that phrase so often that I’m pretty much on autopilot with it[]
  2. Divide by ten, multiple by ten, doesn’t much change the conclusion.[]
  3. Kotturi, M. F., Peters, B., Buendia-Laysa, F. J., Sidney, J., Oseroff, C., Botten, J., et al. (2007). The CD8+ T-cell response to lymphocytic choriomeningitis virus involves the L antigen: uncovering new tricks for an old virus. J Virol, 81(10), 4928-4940. []
  4. The quality of the predictions were not good, either, in that many of the strongly predicted epitopes only stimulated a very few CTL. As well, I’m being a little generous is granting them just 160 predictions; that’s the number they came up with post hoc as what they would have needed — in fact they tested 400 predicted epitopes. []
  5. Peters, B., Bui, H. H., Frankild, S., Nielson, M., Lundegaard, C., Kostem, E., et al. (2006). A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol, 2(6), e65.[]
  6. Trost, B., Bickis, M., & Kusalik, A. (2007). Strength in numbers: achieving greater accuracy in MHC-I binding prediction by combining the results from multiple prediction tools. Immunome Res, 3, 5.[]
  7. I’ll run some trials when I have time.[]
August 1st, 2007

Software special

My computer was having intermittent hard drive problems, and last week they got bad enough that I sent the Macbook Pro off for repair (free under warranty). The good news is they fixed it, as far as I can see; the bad news is they dropped in a new hard drive, so I’m starting from scratch.

I’m pretty paranoid about backups (four times a day, to external drives in two different locations) so I didn’t lose any significant dat. The problem when this happens (the second time in a couple years) or when I get a new computer, is that I have to reload all the third-party applications I use; and that means figuring out what I need, finding them on the net, reinstalling, finding registration numbers in my saved mail, and so forth.

To save myself some time next time this happens, I’m noting down the apps I’ve reinstalled this time. And in case anyone else cares, here they are.

Commercial apps.

Photoshop I hardly ever use this, but it’s nice to have on hand
BookEnds Reference and bibliography software. Much better than EndNote
DNA Strider DNA/protein analysis and enzyme mapping. There are probably better free products out there, but 17 years of finger memory is hard to quit
Flowjo Flow cytometry analysis. Registration is tied to the computer, need to update via the company
iWorks Pages and Keynote from Apple. I don’t know if I’m actually going to reinstall this, I hardly ever use it
MS Office Word,Excel, PowerPoint. Yes, I’ve tried the free alternatives. No, I don’t like them.

Freeware, donationware, and shareware.

Free- & shareware
4Peaks DNA Sequencing reader
Adium Instant messaging client for AIM, Jabber, MSN, Yahoo
Adium themes “Old Phone” dock icon
“A Little Less Than Minimal” contact list
“Good Grey” message style
Big Cat Scripts Contextual menu applescripts. Very useful and customizable
AppleWorks I still have one legacy database in this that I haven’t transfered to SQLite
CyberDuck FTP/sFTP
EnzymeX DNA/Protein analysis. I should probably use this instead of Strider, but old habits die hard
FastScripts Lite Menubar access to Applescripts. Not much difference from the builtin scripts folder
Firefox Not my default browser right now, but useful for some things
Firefox extensions Adblock Plus
Adblock Filterset
Quick Proxy
Download Statusbar
Gleam Flickr uploader
Google Earth  
Google Hosted Mail Notifier Menubar notice when I get new mail in my hosted gmail
Growl System notifications, integrated with several other apps. Also remember to install growlnotify
Journler I haven’t paid for this yet but probably will soon. Flawed, but better than the other organizers/notebooks I’ve tried.
JungleDiskMonitor For my Amazon S3 account. Still in beta. I will probably pay for it when it’s out of beta, but haven’t decided for sure yet
Magic Number Machine Stupid name and dock icon, but a good scientific calculator
MagiCal Customize menu time and date display
MenuMeters Menubar CPU use, also quick access to the Console
PandoraMan For the Pandora Streaming Radio service, which kicks ass
Safari 3 My default browser on Intel Macs, though it was less stable on PowerPC macs. Faster and smoother than Safari 2
ServiceScrubber Eliminate crap from the Service menu, so you can find the few services that are actually useful
SlimBatteryMonitor Uses less space in the menubar than the builtin battery monitor
Stuffit Horrible company, but I still need the expander occasionally
Synergy iTunes monitor and controller. I got a license for this years ago when it was by far the best controller, now competitors (including freeware) are catching up
Synk 5 Backup software. I use it a lot. (The latest version is 6, but I wasn’t impressed by it and already have a licence for the previous version so stuck with v.5)
TextWrangler Programming editor. Nothing else I’ve tried has been as versatile.
Vienna RSS reader
VLC Video viewer
WeatherMenu Weather in the menubar. Now freeware, but I got a licence many years ago when it was by far the best available – now there are competitors that are close
Windows Media Player I used to need this for baseball highlights and listening to games on MLB’s stream. Don’t know if I still need it, but hey.
X11 Apple’s verions of the X window system
XJournal Livejournal blogging client
XCode Apple Developer Tools
XMenu Menubar access to apps. There are a million different apps for this, this is just as far as I got before I found one I was comfortable with

Programming in Python. Apple includes Python2.3 and wxPython2.6 (or so), which are a few versions back of the latest. I update to Python2.4 (the latest is 2.5, which I haven’t moved to yet) and the latest wxPython (2.8.4 as of now).

Python/modules Comments
Python2.4 Programming language. Packaged for OSX here
wxPython2.8.4 For GUI integration
NumPy Numeric modules. OSX package
PySQLite SQLite database API. OSX package
Python Imaging Library (PIL) OSX package
ElementTree XML parsing
mxBase tools Required for BioPython
py2app Turn Python scripts into standalone apps
PyObjC Python/Objective C bridge
appscript Manage apple events with Python (just like Applescripts)
BioPython Many bioscience-related modules
DarwinPorts Unix-type apps ported to Darwin. I don’t use this just for Python, but it fits here better than elsewhere, I guess.

Dashboard widgets. I used to use widgets a lot, but have gradually moved away. I still use some of these regularly, though.

Widgets Comments
Reminder Widget Minimalist reminders (uses growlnotify). I wrote this one
Translate Widget DNA formating, translation, analysis. I wrote this one too
Digital World Clock Know the time for my sister in Switzerland and my brother in Beijing
Flip Clock Widget A clock for local time
iCal Events Widget Upcoming events from iCal
Scoreboard Widget Red Sox game tracking. Like I don’t already have it on the radio
AlbumArt Widget iTunes track indicator, with album art
PEMDAS Widget Calculator widget
Package Tracker Track FedEx, UPS, DHL pacakges

Applescripts for /Library/Scripts. These are just things I slapped together myself; here to remind myself to migrate them over. If anyone cares I can make them available, but they’d likely need tweaking for general use.

Applescripts Comments
File from Safari URLs Save the URLs from all open tabs into a text file
File to Safari URLs Open URLs from a text file into Safari
URLs -> Journler Save URLs from all opens tabs into Journler
Journler URLS -> Safari Open all URLs from a Journler page in Safari
Format Excel Chart Why doesn’t Excel let you save a default chart format? This makes the half-dozen formating changes I invariably set
Format sequence Take a DNA orprotein sequence, make a tidy numbered output
Images size adjust Quickly adjust the size of selected images
Strip sequence Remove all non-DNA characters (including spaces and newlines) from a DNA sequence
Track -> XJournal XJournal reads iTunes tracks, but doesn’t format classical properly. This is a fix for classical tracks

Finally, remember to copy back:

Files Comments
~/Library/Keychains/login.keychain Don’t lose your passwords
~/.bash_profile Terminal shortcuts and formats
~/.pythonstartup Python preferences
July 30th, 2007

Artificial immune systems

The July issue of Nature Reviews Immunology has an intriguingly-titled opinion piece by I.R. Cohen, “Real and artificial immune systems: computing the state of the body.” 1 Maybe I’m missing something, but I find it quite disappointing. Part of that is probably that I’m not exactly the audience he’s after, but I think that’s not all of it.

The paper starts off in a humble and apologetic tone with the explanation that “I attempt to show that reframing our view of the immune system in computational terms is worth our while”. To me it’s self-evident that framing the immune system in this way could be worthwhile (so long as we don’t abandon other frames, of course). I can easily imagine that there are lots of immunologists out there who are skeptical of it, though, and presumably those are the intended audience for the introduction.

Organic computer (Woody Igou)Most of the first half of the paper seems to be a pretty basic introduction to some computational concepts — they’re concepts I’m familiar with, which means they must be basic. A couple of mildly interesting comments arise here, particularly the notion that “the immune system effectively computes the immunogenic state of the body”. I don’t think this is a deeply profound thought, just a re-framing (as Cohen says — I’m not criticizing here, that’s what he said he was doing) of an overview of the immune system. Another interesting point is Cohen’s contrast of the immune system and a Turing system, in that the former is self-organizing: “It may therefore be said that the immune system creates and modifies its own program as it goes”.

At this point I was thinking that these were some interesting turns of phrase, but was wondering whether it was just semantics. A new approach to a field is interesting as far as it generates new questions or answers that can be tested experimentally, so I was looking, in the second half of the paper, for some examples of this; at least some of the kinds of questions that could be addressed. This was where I was really disappointed. The examples he offers as powerful outcomes of “reframing immune-system behavior in computational terms” don’t seem to me to be particularly powerful, or particularly dependent on the reframing.

He talks about “Natural immune reactivity to self-antigens” as one example, and “Assessing states of stress” as another. He claims that “The computational view of the immune system sees natural autoimmunity as a physiological mechanism for detecting and responsing to the states of body cells and tissues”, and contrasts this to the “mainstream” view which has “little tolerance … for the idea that natural autoimmunity could serve some useful purpose”.

First, at least as I remember it, this notion of autoimmunity as functional is one that has repeatedly popped up throughout the history of immunology (so it’s not something that computational immunology has a unique handle on), and the reason it’s not widely accepted is not that it’s been rejected mindlessly, but that there’s never been much or good evidence put forward for it. Rephrasing an old idea in the sparkly new wrapping du jour isn’t helpful unless it offers a new way to test it — which I don’t see here.

The other problem is that this seems to come out of the blue. He claims that the computational view sees autoimmunity in this way but doesn’t really show the connections; to be honest it comes across to me as someone with a particular hobbyhorse, trying to drum up support for an old idea. In other words, I’m neither convinced that this concept is either new to computational immunology, nor that it’s particularly strongly suggested by computational immunology. To be fair, this is a short section in a short review paper, and in a longer treatment I might be more convinced by the argument.

I have the same concerns about the “states of stress” argument, though I will say that this basic concept is one I find much more plausible — though again I’m not convinced it’s particularly unique to this approach, or that it’s particularly strongly suggested by this approach.

So I’m a sympathetic audience to the basic concept that a computational approach to immunology could have some useful outcomes, but I’m not blown away by the examples in this paper. I’d like to see more explanation of the kinds of questions and answers this approach could provide, with concrete examples.

  1. Real and artificial immune systems: computing the state of the body. Cohen IR. Nat Rev Immunol. 2007 Jul;7(7):569-74. []