Lots of people have been analyzing the new H1N1 influenza virus by sequence analysis, comparing to influenza sequences in various databases (I used the NCBI‘s).  How reliable are these databases?

Our observations show that a fraction of the sequences in the database exhibit anomalous properties that point to either radically new biology or, more likely, problems with the data. … We speculate that perhaps the most likely explanation for both of the anomalies reported here is stock contamination in the sequencing laboratories … 1

Influenza Virus Sequence Distribution
IVDB: Influenza Virus Sequence Distribution

(My emphasis) The data are worst for older viruses and for non-human, especially swine, influenza viruses.

As of 2008, the authors identified around 100 of 3300 genomes (about 3%) of the genomes in the influenza databases that were problematic, but they noted that because of the way they identified the probable mixups this is likely a significant underestimate of the problems: “If stock contamination is indeed to blame for these anomalies, the results reported here could represent just the tip of the iceberg.

Thanks to Vincent Racaniello of The Virology Blog and This Week in Virology for  pointing me to the paper. 1   I don’t know if the databases have been cleaned up in the year since the problems were noted, but I doubt it. That said, I think this type of error shouldn’t have a huge impact on tracking the evolution and origins of the new H1N1 virus; it would probably have the same effect as if no sequence was deposited for a particular strain. (In other words, it leaves gaps, but for the most part doesn’t actively steer research in the wrong direction.)

