franky172
Posts: 158 Joined: Jan. 2007

Apologies if someone has already discussed this; I was out of town and am just catching up on things. That said, I'd appreciate comments if anyone disagrees with my analysis  like I said, I'm don't claim to be an expert.
In a discussion about the information content of the genome, I think that DaveScot has subtly misunderstood some fundamentals of electrical engineering and information theory. I am not an expert in these fields, so I might be wrong, so I'm not trying to be snooty here, but I think I am correct, and would appreciate comments.
Let's start at the beginning: Quote  This is a very simple calculation. DNA is encoded in base4 (4 possible bases ACTG at each locus corresponds to digits 0123). Converting from one number base to another is basic arithmetic learned in prealgebra which IIRC was 7th grade for me but I was in an accelerated math program so it might be 8th grade for most students. 
OK. Maybe we should have started in the middle:
Quote  Proteins (coding genes) are encoded in base64 (20 amino acids plus start/stop codes and much redundancy) in triplets of base4 numerals called codons. 
This is all well and good as far as I know.
Quote  This paradigm is not entirely accurate though as frameshifts are often used to encode additional functional proteins using the same sequence and sometimes reading a sequence in reverse (frameshifted or not) yields yet another different biologically active protein. Similar things are done in electronic engineering with regard to multiple methods of data encoding. The nucleic acid sequence can be likened to what’s called a carrier wave. 
Here is where things get weird. I believe that DaveScot is attempting to make the claim that the string of base64 symbols may be multiply encoded with "messages" and that this will increase the amount of "information" in the genome past the 6 gig their calculations show. Let me be explicit here, I believe DaveScot is making the following claim:
(1) "The genome should be understood to be a string of discrete symbols pulled from an alphabet of given size. In general understanding this amounts to 6 or so gig of data. However there may be other *interlaid messages* in this stream of symbols that increase the true amount of information content of the genome past 6 gig." This interpretation of the following argument agrees with his "the gene can be read backwards comment" above, as well as his interpretation of the code as a "carrier wave", and the subsequent discussion of modulation techniques.
If this is the case (which I believe it is), then DaveScot has subtly confused two very fundamental concepts: "information content" and "channel capacity". Now, DaveScot is of course correct in that multiple streams of information can be encoded upon either analog or digital streams of symbols, but he is incorrect in stating that the net information content of a digital stream can exceed the number of symbols transmitted. Dave has confused "coding theory" which is a study of the maximum amount of information that can be transmitted over a noisy channel and "information content" which is the maximum amount of information present in any particular stream of symbols (by the way, if you think that your PhD dissertation was good, Shannon's limit (coding theory) and indeed all of information theory was laid out in Shannon's fucking Master's Thesis). Based on information theory it doesn't matter how many secret messages you put into a stream of symbols  the maximum amount of information in a digital stream is \log{N_{symbolssent}} with the log to the base of the number of possible symbols. Therefore the maximal amount of information encoded in the genome is indeed about 6 gig (or meg, or whatever they decided).
Now note this: I am not entirely sure that my understanding of DaveScot's claim is correct; perhaps he is making the following *different claim*:
(2) "The message in the genome that is encoded in discrete symbols amounts to 6 gig of data, however there may be other *independent streams of information* present in the genome that are not expressed in GATC etc. i.e. the rate of turning of the strands of DNA may contain other information"
This is a fundamentally different claim  it's the equivalent of me writing a message on a piece of paper where the interline spacing varies and contains information. In this case, you can count the number of letters on the paper and still not know the true potential information content of the medium. However, this interpretation does not fit well with DaveScot's mention of the "nucleic acid sequence can be likened to [...] a carrier wave", which to me implies that the information is encoded in the discrete language of the acids, as well as his interpretation of "reading forwards and backwards" or "at different speeds".
Quote  I suspect there are modulation methods on the DNA carrier that are yet to be discovered. 
But as far as the information content of the genome goes (if we understand it to be a digital code), this doesn't matter  the measure of maximal information present in any digital stream is defined as the log of the number of symbols sent. Period. It is true that we can simultaneously encode multiple streams of data  different error correction codes do exactly that (or shifted versions of the same message, see: turbo codes, etc). But these codes are incapable of encoding more information per transmitted symbol than the logbaseN_{possible symbols} (I am assuming that we are discussing typical measures of information here, and not some measure that is entirely subjective).
Quote  As the stream of grease emerges it folds. By varying the rate or speed at which the grease comes out it folds differently. The codons are not all translated at the same speed thus even though the same acid is translated from as many as 6 different codons they each have the potential of producing a different fold due to different processing speeds. 
This is excellent and interesting and exciting information, but it does not alter the fact that 6 binary (or whatever base) digits contain exactly 6 bits of information  by definition. That I can poor water over my staircase at different speeds and get different results doesn't change the amount of information that my stairs have (of course, this is still assuming we are dealing with standard information definitions). In the same way, the fact that I can encoDe A Very sEcret meSsage in a COncaTenation of symbols does not change the fact that the maximum amount of information in a string of letters, for a random example is \log_{26}{Message length}.
Edit: fixed URL, formatting
