dogdidit
Posts: 315 Joined: Mar. 2008
|
Quote (slpage @ Aug. 27 2008,13:57) | Quote (dogdidit @ Aug. 27 2008,13:17) | Actually, it is very compressible if it is a DNA sequence, since codons (triplets of base pairs) code for only 22 possible states - start, stop, and twenty amino acids - even though the symbol set could accommodate 64. So the real measure of information in DNA is no more than 4.5 bits (log2 of 22) for every three base pairs, not 6 bits (log2 of 64). |
Yes, but isn't the compression you write of 'conceptual' (I can't think of a better word)? Sure, you can run a computer file through a compression algorithm and all that, but DNA is physical - more akin to trying to 'compress' a CD as opposed to the 'information' ON the CD, if my point is making any sense. |
The OP spoke about using bits to encode the nucleotides: Quote (goalpost @ Aug. 27 2008,12:21) | Both messages contain a human DNA sequence - ACGT etc etc, each letter coded as two bits, ie 00 = A, 01 = C, 10 = G, 11 = T. |
...so I was responding to that. I would agree that compressing functional DNA does not seem possible. Perhaps a very large steam press...
Quote | Quote | Quote | Message one's sequence codes for a protein. Message two's contains junk DNA.
Does message 1 contain more information? |
Difficult question. What you're asking is how much entropy (uncertainty) is there in the sequence of amino acids (our message set) in the proteins that make up the human proteome. Are some amino acids rarer than others? Are some amino acids sequences more likely than others? If the answer is yes, then the entropy of the source will be less than that of a source whose symbols have equal probability. That would reduce the information content from 4.5 bits per codon to something less.
Junk DNA, assuming it is not under selection pressure (else why would it be "junk"?), would be likely to accumulate mutations more rapidly than DNA related to the proteome, yes? Those mutations should help to "shuffle the deck" and over time one would expect the symbol set to drift toward equiprobability. (But never quite get there - equally random sequences of base pairs does not code for equally random sequences of amino acids.) So my guess is that yes, the junk DNA has more information (as defined by information theory) than DNA that codes for proteins. |
OK, so while we are discussing hypotheticals, how about this one.
Two DNA sequences, both 1000 bps long, both identical with one exception - one sequence starts with TAA instead of TAC. The 'functional sequence' has a measured information content of (just tossing out a number here to make it simple) 1000. Would the non-functional sequence have a content of 999 or 0? |
1000. That assumes a C is as likely as an A. Functionality ("semantic content") is irrelevant.
@Turncoat: yep, I am using Shannon's definition (and thanks for not mentioning my errors).
-------------- "Humans carry plants and animals all over the globe, thus introducing them to places they could never have reached on their own. That certainly increases biodiversity." - D'OL
|