The question on Quora was:
My answer was:
A typical protein is about 350 amino acids long. I am not aware of any English or colloquial words that are 350 letters long. Very few, if any, functional proteins are less than 20 amino acids long, which is still very long for English words.
Many protein sequences contain within them English words and names. ELVIS can be found in many proteins, but ELVISISALIVE hasn’t turned up yet. CRICK can be found in many, FRANKLIN appears once in a hypothetical protein from Treponema primitia () and of course WATSON is impossible.
What’s the longest English word that can be found in the GenBank protein collection? Offhand, I don’t know (and it will change on a regular basis, at the rate the collection is growing). I bet I can find it in a few lines of code, though, and if no one beats me to it I’ll take a shot at it tomorrow; it’s too late tonight.
Update: The longest more or less English word I can find in the human reference sequence protein database is “TARGETEER”, 9 letters long. It’s found in several isoforms of “C12orf42”, e.g..
I only looked in the human reference sequence library, not the complete protein database for NCBI, which would have taken too long for download (too long for the mild curiosity I had, anyway). This database has 72,204 protein sequences in it, with a total length of 46,315,661 amino acids; average protein length 636.4, median length 467.0, geometric mean length 468.5, distribution looking like this:
For words, I used the builtin unix dict (on my computer, /usr/share/dict/words), which contains 235,886 more or less English words ranging from 1 through 24 letters long (THYROPARATHYROIDECTOMIZE
“TARGETEER” was the 119,925th-longest word in the dictionary, and since I started with the longest and worked down it was over halfway through the dictionary (50.8%) before I got the first hit. All in all, it took close to an hour to run in the background, with no attempt whatsoever at optimizing the script.