|Wesley R. Elsberry
Joined: May 2002
How many codes could there be?
The simple answer is "lots". The canonical genetic code has 64 entries coding for 21 different things (20 amino acids plus a "stop" signal). It is called "degenerate", which is a fancy way of saying that there are more codes than there are things coded for, which results in some redundancy in the canonical code. If you look closely at how codons are matched with amino acids, you will likely notice that in many cases a change in the third base of the codon results in no change in the coded-for amino acid. This results in a typical clustering of codons, so that a change of one base has about a one-in-five chance of causing no change at all in what amino acid is coded for. In other words, the canonical genetic code is not as "brittle" as it could be. I hope to explore that more thoroughly later.
But back to the question of interest. How many "genetic codes" could there be? Let me be clear here. The phrase "genetic code" is sometimes sloppily used to refer to the specific sequence of bases observed in the genome of an organism. That's not the way I am using it here. The "genetic code" is used here as the way in which triplets of three nucleotide bases are mapped to corresponding amino acids for the purpose of protein synthesis. Figuring out how many different ways such a code can be instantiated can be approached through combinatorial "counting rules".
The first "counting rule" of interest is the factorial function. Given some positive integer number n of items, the factorial is defined as the product of every positive integer greater than or equal to one and less than or equal to n. The number of different ways 64 symbols can be represented as a sequence is factorial(64) (or 64!), or about 1.268869e89. Since a degenerate genetic code doesn't have 64 different symbols, but rather 64 positions for symbols, this represents an upper bound on the number of possible genetic codes using triplet codons.
So what counting rule gives us what we want? The answer is the "partition rule". This tells us that the number of ways that k different symbols can be arranged to fill n spaces when we know how many of each of the k symbols there are. The rule is
n! / (m_1! * m_2! * ... * m_k!)
The sum of m_1 through m_k = n
For 21 symbols, the worst case situation would be if most of the code specified a single amino acid. This occurs if one symbol is repeated 44 times and the remaining symbols have 1 instance each. In this case, application of the partition rule tells us that there are 4.8e34 possible codes of that sort.
The best case situation is where all the codes are as nearly evenly represented as possible. This is the case when one symbol has 4 instances and the remaining 20 symbols each have 3 instances. In this case, there are about 1.4e72 possible different codes of that sort.
If we take the distribution of symbols in the canonical code, we have 64! / (4!6!2!2!2!2!2!4!2!3!6!2!1!2!4!3!6!4!1!2!4!), or about 2.3e69 possible different codes of that sort.
It is interesting that the actual canonical genetic code has a distribution that would permit almost as many variants as the very best case situation.
Next up will be considering what the numbers mean for evolutionary biology.
"You can't teach an old dogma new tricks." - Dorothy Parker