Saturday, April 30, 2016

DNA 101

I recently posted a blog entry (see http://sigma5.blogspot.com/2016/04/positive-identification.html) that talked a good bit about DNA.  For this post I have decided to dive a bit deeper.  The existence of DNA has been known for less than a hundred years but there were hints of something way before that.  For thousands of years plant and animal breeders knew that there were certain characteristics of the plants or animals they were interested in that "bred true".  The coat of a housecat, for instance, does not look exactly like the mother or the father.  But other attributes did seem to "take after" one parent or the other.  So there seemed to be some mechanism operating that passed some attributes down from parent to child.  The most obvious attribute, species, definitely behaved that way.

The first scientific investigation into what was going on was done by Gregor Mendel in the 1850's.  He cheated a little by focusing on a few inheritable traits and ignoring others that did not seem to be inheritable.  But he was able to develop some rules of inheritance.  As an example he figured out how height (tall or short) for pea plants was influenced by the height of its parents.  He published his result and no one paid any attention at the time.

But later people did similar investigations and got similar results.  And when they did they also did literature searches that turned up Mendel's old paper.  And all this resulted in the invention of the word "gene" to describe what was going on.  No one knew what a gene was but they were pretty sure such a thing existed.  And that soon led to an investigation of something called chromosomes and that led to an interest in a chemical called DNA.  No one knew how DNA was connected to genes but it was pretty likely there was a connection.  So the next step was to figure out the structure of DNA.

The puzzle was solved almost exactly a hundred years after Mendel's paper was published.  Credit officially went to Watson and Crick but many others made critical contributions.  But they limit the number of names on a Nobel Prize to three and the Nobel committee decided to pick just the two of them.  And that's how they got to be the people who are generally given all the credit.  So what is the structure of DNA?  The answer is, as they say, "very interesting".

The Twitter-length answer is "it's a double helix".  So let's see what a double helix is.  First of all, a helix is just a corkscrew shape.  A classic example, which turns out to be helpful in what follows, is a spiral staircase.  You have a central pole.  About six inches above ground on one side this pie shaped thing sticks out.  It is narrow at the pole end and wide at the other end.  And it is flat so you can step on it.  Then there is a similar pie shaped slice sticking out of the central pole about six inches above the first step.  The trick is that it is not exactly straight above the first step.  Instead it is set back a convenient amount.  This lets you step onto the first step then step onto the second step just like you were climbing a flight of stairs.

And it is the same idea with the third step.  It is about six inches above the second step and set back so you can step up to it.  And so we go.  The steps spiral around the central pole.  Eventually a step will end up straight above the original step.  But if things have been done right there is enough head room between these steps.  That means you can ascend the stairs as they spiral up around the central pole without bumping your head.  If you connect the outside ends of the steps with a smooth line you get a corkscrew shape when you look at it in in three dimensions.  And there is only one "screw" in the corkscrew so this is a "single helix" design.  But this is not the only possible design.

Let's assume the center post is quite big, say ten feet across.  Then it is easy to arrange the steps so that they spiral up just like before.  But now there is room for two spirals.  Let's say that there is another step exactly across the central pole from each step in the original spiral.  Now you can have two independent spirals.  They each circle up the post.  But you can have people going up using one spiral and people going down using the other spiral.  And, since you have two spirals, you can have people going up and other people going down at exactly the same time.  That's a double helix.

Now let's make another change.  Let's get rid of the central post.  If we connect the outside ends of each step with the one above and the one below it and we make this outside structure strong enough, we don't need any central pillar.  The central pillar is the most common way spiral staircases are made but it is not the only way it can be done.  And now you have DNA.  It is a double helix (two staircases) with an outside support structure instead the central post.  So that's nice but it is only the start of what makes DNA interesting.

I have never played with Leggo blocks but I know what they are and I bet you do too.  The interesting thing about Leggo blocks for the purposes of this discussion is that they snap together and they only snap together a few different ways.  There are little pegs and sockets in each piece.  The pegs fit snugly into the sockets if you snap them together correctly.  If there is something else where the socket needs to be then two Leggo blocks won't snap together that way.  And if there aren't enough pegs in sockets the Leggo construction isn't very strong.  DNA is the same.

With DNA there are four kinds of "blocks".  Each block is a specific chemical with a specific shape.  They snap together but only in a few specific ways.  Each of them is roughly pie shaped with a wider end and a narrower end.  When they are snapped together the wider end is toward the outside and the narrower end is toward the center, just like the steps in a spiral staircase.  And the outside of each block has a shape that can snap into the other blocks.  And the way they snap together is quite strong and requires that the blocks be offset just like the stairs in a spiral staircase.  This is just the structure we need for a "no center post" design.  And with respect to this outside part you can mix and match all four types of blocks any way you want.  As you move along the outside spiral any block type can be followed by any block type.  That's pretty cool.  But now for the really cool part.

The four blocks can also connect together on the inside end.  But with a twist, well actually a flip.  And only certain specific combinations are allowed.  The four building blocks have names (you can easily find them if you care what they are) but mostly they go by the first letter of each name.  The first letters are A, C, G, and T.  And if you turn T upside down it will connect up with A but not any other way.  Similarly, if you flip G upside down it will connect with C but not any other way.

Now one of the interesting attributes of a helix (or in this case a double helix) is that if you flip it upside down it looks the same.  So what you have with a double helix are two helixes spiraling around the same core.  If we call one of them the "up" spiral then the spiral would look just the same if we flipped it upside down.  And the "down" spiral could be flipped upside down and it would then look like the "up" spiral.  In one spiral the "top" of each step faces "up" and in the other spiral the "top" of each step faces "down" but that's the only real difference.

And in fact that's exactly what happens with DNA.  One spiral has all the building blocks right side up.  The backbone around the outside snaps together just fine.  We can have the building blocks occur in any order we want and it will all work fine.  The other spiral has all the building blocks upside down.  The backbone around the outside snaps together just fine.  And, if we ignore the first spiral, we can have the building blocks occur in any order we want.  But it turns out one spiral can't ignore the other one.

If at a certain spot on one spiral we have an A then we must have a T in the spot that is straight across.  The same is true for the other three building blocks.  We must always have a G across from the C, a C across from the G, and an A across from the T.  And this characteristic is magical.  It means that if we know the sequence of the building blocks on one spiral we can figure out with 100% certainty the sequence of the building blocks on the other spiral.  And this is how DNA can be duplicated.

The process goes like this.  The two spirals are unzipped.  Then each spiral is processed separately.  You just run down the building blocks.  For an A you turn a T upside down and snap it in on the other side.  For a C you snap in an upside down G and so on.  Each spiral is processed separately.  But for the "up" spiral the "down" spiral is rebuilt.  And for the "down" spiral the "up" spiral is rebuilt.  The result is two absolutely identical DNA molecules.  Each new DNA molecule has the same spirals with the same sequences of building blocks as the original DNA molecule had.  Once the structure of DNA was determined this other stuff was figured out pretty quickly.  But it turns out to be only the start of the fun.  So let's move on to the next cool thing.

DNA is a very efficient way to store data.  And in a lot of ways that's what it is, a string of text.  Now the text has an alphabet of four characters, A, C, G, and T.  So it's not binary.  It's quaternary (base 4).  Okay, how far can we get with a four character alphabet?  Not very far but there's a trick.  (There's always a trick.)  A byte is a group of 8 bits.  So, while a bit can only have two values, a byte can have two hundred and fifty six values.  That's a lot more interesting.  DNA doesn't do exactly the same thing but the idea is similar.  Let's group three of these characters together.  It turns out if you do the math right, you can represent 64 different numbers.  And that's what the cellular mechanisms that process DNA do.

The whole process is like reading a telegraph message.  A short click is a dot.  A longer click is a dash.  A longer than normal pause between clicks indicates that we have come to the end of a letter.  Originally Morse Code, the language of telegraphs, had no punctuation.  So the word STOP was used to indicate a period.  Telegraph operators were just supposed to figure out where the word breaks went.  And telegraph messages were in ALL CAPTIAL LETTERS and contained no punctuation, hence the need for STOP.  It worked and after a little experience telegraph operators usually got the word breaks right.  Readers and writers of telegrams just had to deal with the lack of lower case letters and punctuation as best they could.

The process of processing DNA has the same kinds of problems but without an intelligent telegraph operator to help out.  So let's ignore some of the above issues for the moment and focus on the message.  We have 64 "letters" in our alphabet to play with.  And mostly what they are used for is to represent an amino acid.  There are only 22 amino acids that this system is used with so it should only take 22 of our 64 letters to uniquely represent them all.  What's the story with the rest of the letters?  It turns out that several letters translate to the same amino acid.  In the end, most of the 64 possible letters get used for this purpose.  And here's the cool part.  A whole lot of the chemicals our bodies use are actually just a bunch of amino acids strung together in a specific sequence.  So DNA literally "codes" for molecules that consist exclusively of a specific string of amino acids.  This "letters to amino acids" process goes by the name of the "genetic code".

There is a bunch of machinery in our cells that can read off a particular string of building blocks, translate that into a list of amino acids, and build a molecule that consists of that exact sequence of amino acids.  This process can and is used to build literally tens of thousands of complex organic molecules that are used for a bewildering number of purposes by one cell or another in our bodies.  It can't be that simple can it?  Of course not.  Let's look at some of the simpler wrinkles first.

As I said above, telegraph operators are good at figuring out where the word breaks go in a message.  Lacking telegraph operators a different mechanism is employed.  These three letter groups are called "codons".  There is a special "start" codon and a (actually three) special "stop" codon.  So the translation process scans the DNA spiral until it finds a start codon.  Then it takes the next block of three letters and uses the genetic code to figure out which amino acid goes first.  It grabs that amino acid then it moves on.  It applies the genetic code to the next three letters, finds the correct amino acid, and hooks it up to the first one.  Step by step it keeps adding amino acids until it hits a stop codon.  Then it stops.  This is what a lot of the genetic machinery in the cell does.  But that's not the end of the list of problems.

I remember asking a molecular biology student many years ago which is the primary helix and which is the secondary one?  He didn't know the answer but I now do.  There is no primary or secondary helix.  This molecule building mechanism processes both helixes just the same.  Can this present problems?  It sure can.

Remember those gene things I mentioned above.  Well we now know that a gene is just a string of DNA.  Things are a little more complicated than the "string together amino acids" process I outlined above.  But the idea is pretty much the same.  With genes there is a longer sequence called a "5'" (five-prime - Why is it called that?  I don't know) sequence.  It marks the beginning of a gene.  There is another longer sequence called a "3'" (three-prime - Why is it called that?  Again I don't know) that marks the end of the gene.  And lots of genes just code for a protein.  And a protein is just a specific string of amino acids.  And the above mechanism is used to translate the DNA information into the recipe for the protein.  So can't things get confused between one DNA spiral and the other?  Yes.

It is possible for a DNA sequence on one spiral to be part of a gene while the DNA sequence on the other side of the exact same piece of DNA is part of a completely different gene.  This generally works out.  Scientists really don't know why or how.  For one piece of DNA there is this sequence of letters on one spiral that translates to whatever it translates to.  Yet the other complimentary spiral is also used in the exact same way to translate into whatever it translates to.  This all somehow works.  It shouldn't but it does and scientists only barely understand a very little bit about why it works.  Scientists know there is a vast amount they don't know about all this.  And the more they learn the more stuff pops up that they don't know about.  Moving on . . .

Scientists have known about this whole "code for a protein" business for a while now.  And they have figured out how to tell which DNA parts are involved in this "code for a protein" business.  And it turns out to be about 2% of all the DNA we have.  So what's with the other 98%?  Scientists originally called this "junk" DNA.  It looked like it was just taking up space without doing anything useful.  But then there was this whole mutation thing.

The DNA replication business I explained above works miraculously well.  But it turns out that there are 3 billion letters worth of DNA in each of us.  It doesn't take a very high error rate to mess things up.  A "one in a million" error rate would result in 3,000 mutations.  If that sounds like a scary large number it's because it is.  And mutations can cause very bad things if they happen in an important sequence.  But junk DNA is unimportant, right?  That sounded like a good theory but it didn't pan out.

You can compare the DNA of multiple individuals.  Or you can compare the DNA of people to the DNA of pigs or even plants.  Now I am NOT talking about actually inserting DNA from say a pig into a person.  I'm just talking about looking at pig DNA and people DNA to see what's different and what's the same.  And it turns out that there are large chunks of DNA that are the same from person to person or person to pig or even person to plant.  This is called "conserved" DNA.  Take a string of DNA that is the same in people and plants.  It must do something pretty damn important or it wouldn't be the same in both places.  Now what if this DNA was junk DNA?  If we understand junk DNA correctly it should be able to mutate like crazy without hurting a thing so it should not be the same in people and plants.  Scientists concluded that a lot of "junk" DNA was not junk because it was conserved.

So if it is important, how is it important?  Let me take an apparent digression that really isn't a digression.  The DNA in every cell of your body is the same.  So why is a muscle cell different than a brain cell different than a skin cell different than . . .  ?  It turns out there is a whole gene regulatory system.  Your muscle cell is a muscle cell because certain genes are turned on in that cell and many others are turned off.  The same is true for the other cell types.  Each has the same genome.  But each has a different pattern of which genes are turned on (activated) and which are turned off.  Scientists really don't understand how this gene regulation business works but they know it involves what is no longer called junk DNA but is now called "non coding" (because it doesn't code for a protein) DNA.  There is lots going on with this non coding DNA but scientists are just starting to figure it out.  They know they have a long ways to go.

But wait, there's more.  Theoretically, all you inherit from your parents is some DNA, right?  Wrong.  There are ways for your mother to influence some settings for this regulatory system that do not involve DNA.  Scientists have gotten to the point where they know this is happening but they don't know how it is happening.

Finally, there is reason to believe that cancer is a set of diseases whose common element is that the regulatory system for gene expression goes haywire somehow.  This often causes cells to reproduce like crazy (cancerous growth).  It also seems to drastically increase the mutation rate of affected cells.  This seems like bad news but it is actually good news.  A lot of cancers may turn out to actually be the same.  A large group of seemingly unrelated cancers could all be caused by the same problem with the gene regulatory system.  If this is true and if a fix for whatever the problem is can be developed then this one fix can be successfully applied to treat all the cancers in the group.

In the late '60s when the War on Cancer was first declared it was thought that there was something like 50 different cancers.  Now we can identify many thousands of cancers that are different in some large or small way from other cancers.  Developing many thousands of cures each applicable to only a small group of people because it is only known to be effective for one very specific cancer sounds depressingly difficult.  But if there are a relatively small number of underlying causes for most cancers then only a few cures could make a tremendous difference.  That is the best news cancer researchers and their patients have had in many decades.



1 comment:

  1. 5' and 3' are just the way scientists label the building blocks of the outside of the staircase. With naming in organic chemistry you need to be able to tell which carbon in the ring is connected to something else so we just count the molecules in the ring. Wikipedia is pretty good with more info if you want a deeper understanding. :)
    https://en.wikipedia.org/wiki/Directionality_(molecular_biology)

    If you watch the movie Gattaca it has a staircase in the apartment because DNA is so very similar to a staircase. It's such a great metaphor for it.

    ReplyDelete