Thursday, September 06, 2012

The ENCODE Data Dump and the Responsibility of Science Journalists

ENCODE (ENcyclopedia Of DNA Elements) is a massive consortium of scientists dedicated to finding out what's in the human genome.

They published the results of a pilot study back in July 2007 (ENCODE, 2007) in which they analyzed a specific 1% of the human genome. That result suggested that much of our genome is transcribed at some time or another or in some cell type (pervasive transcription). The consortium also showed that the genome was littered with DNA binding sites that were frequently occupied by DNA binding proteins.

THEME

Genomes & Junk DNA
All of this suggested strongly that most of our genome has a function. However, in the actual paper the group was careful not to draw any firm conclusions.
... we also uncovered some surprises that challenge the current dogma on biological mechanisms. The generation of numerous intercalated transcripts spanning the majority of the genome has been repeatedly suggested, but this phenomenon has been met with mixed opinions about the biological importance of these transcripts. Our analyses of numerous orthogonal data sets firmly establish the presence of these transcripts, and thus the simple view of the genome as having a defined set of isolated loci transcribed independently does not seem to be accurate. Perhaps the genome encodes a network of transcripts, many of which are linked to protein-coding transcripts and to the majority of which we cannot (yet) assign a biological role. Our perspective of transcription and genes may have to evolve and also poses some interesting mechanistic questions. For example, how are splicing signals coordinated and used when there are so many overlapping primary transcripts? Similarly, to what extent does this reflect neutral turnover of reproducible transcripts with no biological role?
This didn't stop the hype. The results were widely interpreted as proof that most of our genome has a function and the result featured prominently in the creationist literature.

I don't blame science journalists for this. Lots of scientists also used the ENCODE result in 2007 to attack junk DNA. They honestly felt at the time that if a sequence was transcribed, no matter how rarely, it must have a function. They honestly felt that if a DNA binding protein bound to a piece of DNA then that site had a function.

THEME:
Transcription

Other scientists expressed skepticism over the interpretation of the ENCODE pilot project result. Some of them even disputed the data by showing that different techniques gave a different result on the pervasiveness of transcription. The most famous of these papers is the once from my colleagues here at the University of Toronto, Ben Blencow and Tim Hughes (van Bakel et al. 2010). There was lots of activity in the blogosphere as well [Pervasive Transcription].

The bottom line is that after five years of debate and discussion it is well established that just because a fragment of DNA is transcribed does not mean that it has a function. Transcription could be accidental and the product could be junk RNA [Useful RNAs?] [What is a gene, post-ENCODE?] [Junk RNA]. We now know How to Evaluate Genome Level Transcription Papers.

I'm not saying the issue is settled, although I strongly favor the idea that most of our genome is junk. What I'm saying is that in spite of the hype in 2007 the supporters of junk DNA have made a good case and this is still a legitimate scientific controversy.

We have also pointed out that just because a site is occupied by a DNA binding protein does not mean that it is functional. In fact, once you understand how DNA binding proteins work you expect many of them to be sitting nonproductively at sites that resemble the actual functional binding site [DNA Binding Proteins] [Slip Slidin' Along - How DNA Binding Proteins Find Their Target]. It has been widely known since 1976 that the problem with large genomes is that they soak up DNA binding proteins that are binding nonspecifically to DNA (Yamamoto and Alberts, 1976). This is not controversial, if you know what you're talking about.

Now comes the followup ENCODE study extended to cover (almost) the entire genome. The results are published in 30 papers, several of them in a single issue of Nature (Sept. 6, 2012) [Nature ENCODE: Research Papers]. I haven't read all the papers but my first impression is that there's not much that's new except that the dataset is now more complete. Here's what the consortium members say in the abstract [An integrated encyclopedia of DNA elements in the human genome].
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Naturally this was interpreted by science journalists as proof that most of our genome isn't junk. Examples include, unfortunately, Ed Yong [ENCODE: the rough guide to the human genome], Fergus Walsh of the BBC [Detailed map of genome function], and Gina Kolata of The New York Times [Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role].
UPDATE: Ryan Gregory has collected a bunch of articles in the popular press: The ENCODE media hype machine.
At least one science journalist has put his interpretation on video. Here's Ian Sample of The Guardian [What the Encode project tells us about the human genome and 'junk DNA' - video]. You really need to watch it to see the extent of the problem. I wonder how long this will stay up?

This is 2012. A simple Google search will reveal that the concept of junk DNA is still alive and well. A search like that will also reveal the problems with interpreting the ENCODE result since we've had years of debate over the initial pilot study. There's no excuse for this kind of sloppy journalism.

Science journalist have been badly burned several times in the past few years. Surely they should know by now that a single paper on a new fossil won't overthrow our understanding of human evolution [Good Science? Bad Science Journalism?] nor will a single paper on arsenic in DNA make me rewrite my textbook. Science doesn't work that way. A single study won't cause us to entirely re-think our concept of the genome even if it's in thirty papers in Nature.

Responsible science journalist should have dug deeper to find out whether the new ENCODE data was any better than the earlier data and whether their interpretation of the results is being widely accepted in the scientific community. They don't have an excuse this time.

[The scientists who wrote the paper and the scientists who reviewed it will get theirs in a separate post.]


The ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 799-816. [doi:10.1038/nature05874]

van Bakel, H., Nislow, C., Blencowe, B. and Hughes, T. (2010) Most "Dark Matter" Transcripts Are Associated With Known Genes. PLoS Biology 8: e1000371 [doi:10.1371/journal.pbio.1000371]

Yamamoto, K.R. and Alberts, B.M. (1976) Steroid Receptors: Elements for Modulation of Eukaryotic Transcription. Ann. Rev. Biochm. 45:721-746. [

34 comments:

  1. The Half Crazed Cleaning Lady from Sector &%@Thursday, September 06, 2012 12:53:00 PM

    Larry, the real world of information storage systems guarantees that there will be an accumulation of 'junk' information, be it DVD's, magnetic tape or DNA. Natural selection, however, will tend to impose a net deletional bias on trash DNA because of the energy/fitness costs in carrying it around and transcribing it. It would be surprising, therefore, if it was the norm for various life forms to carry around huge amounts of trash code in their DNA. Occasional examples as anomalies, maybe, but we would predict that it should not be the norm. At to that what we are learning about the regulatory part of the genome. It is looking more and more like a gigantic piece of software that runs numerous sections in parallel, each module providing inputs to the other sections so that the whole output is guided by many constantly changing variables. I think Venter has started to move 19th century Darwinian biology out of the dark ages of the 20th century and into the 21th century when he recently stated, ""All living cells that we know of on this planet are 'DNA software'-driven biological machines comprised of hundreds of thousands of protein robots, coded for by the DNA, that carry out precise functions," (New Scientist, July 13, 2012). My point is that in a complex software package like what is encoded into the human genome, it should not be at all surprising if the trend is to discover that more and more of it is actually used at some point, even if it is only a redundant back up system and the 'junk' portion is smaller than we thought, albeit there will always be errors and junk in real life. The trend in science is that there is a lot less junk than we thought and, maybe, more use for seemingly useless stretches of DNA that we had surmised. True, this is a prediction of those scientists who suspect that intelligence was involved in the programming of life, but diminishing 'junk' is also a prediction of natural selection. So .... why fight it?

    ReplyDelete
    Replies
    1. Cleaning Lady: "It [the genome] is looking more and more like a gigantic piece of software that runs numerous sections in parallel, each module providing inputs to the other sections so that the whole output is guided by many constantly changing variables."

      That's begging the question. The genome certainly does not look like "a gigantic piece of software that runs numerous sections in parallel." That is exactly what you need to prove. You're assuming what you need to prove. The ENCODE results have not moved us much closer than that-- they found more regulatory elements, so that might be 8 or 9% of the genome.

      At no point have you addressed any of the POSITIVE arguments for junk DNA, which have been around for decades and which Larry has repeated over and over.

      Cleaning Lady: "The trend in science is that there is a lot less junk than we thought..."

      "The trend." Like the trend that North America is approaching China via plate tectonics, at about the same speed.

      "A lot less." Way to be quantitative. What next? "Much DNA"?

      Delete
    2. It is looking more and more like a gigantic piece of software that runs numerous sections in parallel, each module providing inputs to the other sections so that the whole output is guided by many constantly changing variables.

      Not at all, in fact the results point in exactly the opposite direction - the genome is a giant stochastic mess and it will be extremely difficult to understand it in terms of computer analogies.

      Delete
    3. @Cleaning Lady,

      The presence of absence of junk DNA is not something that you can simply deduce because you want our genome to look like a manufactured storage system. Nor can you deduce what our genome should look like by expressing a faulty understanding of evolution.

      There's real science that you have to deal with. When would you like to start?

      Delete
    4. She's a cleaning lady, we shouldn't be so hard on her.

      Delete
    5. @Cleaning lady: you are correct in that this is not about science vs ID. It's about which of adaptation and drift are dominant at the molecular level. A key unknown quantity is what percentage of molecular changes that rise to fixation are adaptive - we expect the amount of junk DNA to be small if, and only if, this percentage is high.

      Delete
  2. What is up where all comments are italicized?

    ReplyDelete
  3. What is up where all comments are italicized?

    The first poster probably forgot to close his italics. The problem is it does not allow me to insert a second closing tags in my post

    ReplyDelete
  4. Dear Prof

    What are you trying to defend? From the paper;

    "The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is
    unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription,
    transcription factor association, chromatin structure and histone modification. These data enabled us to assign
    biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many
    discovered candidate regulatory elements are physically associated with one another and with expressed genes,
    providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical
    correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation.
    Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an
    expansive resource of functional annotations for biomedical research."

    http://www.nature.com/nature/journal/v489/n7414/pdf/nature11247.pdf

    The other week you called me a liar about this if the data vindicates me tell me prof am I still the liar?

    Regards

    ReplyDelete
    Replies
    1. The data doesn't vindicate you. That's the fucking point here. Please, PLEASE show the research you did that establish what function these newly discovered transcribed regions do.

      Oh, you didn't do any such rearch you say? Then pardon me, but how the fuck do you they're functional?

      Lets get this one down, again: Transcribed =/= "functional".

      Delete
    2. The other week you called me a liar about this if the data vindicates me tell me prof am I still the liar?

      If the interpretation of the data is correct then the implications are enormous and I was totally wrong about the amount of junk DNA in our genome.

      We really would have to re-write the textbooks because the new interpretation tells us that most of the DEFECTIVE transposons in our genome have a function and most of the sequences in introns are functional. The interpretation also tells us that a typical gene requires about 10,000 bp of regulatory sequences to control its expression.

      Delete
    3. Andre,

      Unfortunately for you, science is not done by quoting what the authors of some paper said, but rather by looking at the data and checking if such data supports the quoted stuff. Creationists don't know this. IDiots don't know this. Scientists, if they do not know, they should know.

      So, please, check the data and definitions in the paper.

      Delete
    4. Andre,

      Are you prepared to weigh up the pros and cons of the various arguments regarding the use of 'function' in this paper? You seem mightily eager to accept this report unquestioningly, in a way that I'd be betting you would not for papers that pointed the other way.

      Many people are skeptical of the interpretation of these results - not because 'junk' buttresses a worldview, but because there are good reasons for supposing that
      a) junk - non-essential sequence - is a real category
      b) Particularly in eukaryotes, it is a substantial fraction.

      So ... victory-dances are premature.

      The problem remains of explaining the wide variation across the eukaryotes in the proportion of noncoding to coding DNA - even within the same genus or species. If it all has a 'function' (by virtue of showing up in an assay), why do some need so much more of it than others?

      Only about 5% of the human genome shows a lower-than-neutral substitution rate. This suggests that, whatever its 'function', the other 95% does not depend on sequence.

      Although the opening up of chromatin by transcription of whatever lies within may truly be a sequence-independent 'function', with genuine phenotypic consequences, one really has to explain why some species devote 95% of their genome to this function, others a mere 30% or less. It's like saying the 'function' of the 10,000 feet of cabling I have used to attach my headphones to the iPod in my pocket is to enable me to listen to the music!

      Delete
  5. Larry,

    What if you were to start doing your job as a biologist and try to find out what DNA does, instead of declaring it junk?

    ReplyDelete
    Replies
    1. Funny stuff.
      May I also suggest that people who have no idea whether it's junk stop declaring it's functional simply because it's transcribed?

      Delete
    2. What if you were to start doing your job as a biologist and try to find out what DNA does, instead of declaring it junk?

      What if the authors of the ENCODE project started doing their jobs as biologists and tried to find out what all that transcribed RNA does, and what all those transcription factor binding sites do, instead of just ASSUMING they are non-junk?

      Delete
  6. The Half Crazed Cleaning Lady from Sector &%@Thursday, September 06, 2012 2:11:00 PM

    The comments by Diogenes, Georgi Marinov, and Moran are classic responses from people who have their heads firmly planted in the sand and refuse to go where the science points. Time to pull those heads out of those dark holes and either retire, or move into the 21st century. Your Archie Bunker 'it's junk DNA so no need to worry about it' approach is a real science-stopper. Kinda like a stone-age tribesman finding a cellphone and, since it fits no function so far as hunting and fishing goes, he chucks it into the river and continues to forage for grubs.

    ReplyDelete
    Replies
    1. I have read this blog many times. Not once has Larry declared "so no need to worry about it."

      Delete
    2. I'm most definitely not due for retirement - I haven't even gotten my PhD yet :)

      P.S. If it serves some very useful function, then how exactly would you explain the elaborate mechanisms that organisms have for keeping transposons silent?

      Delete
    3. Your Archie Bunker 'it's junk DNA so no need to worry about it' approach is a real science-stopper.

      Back that up with evidence, or shut your lie-hole. Idiot can't prove that with any evidence. Egomaniac asshole complimenting herself for her supposed superior intelligence. Fuck you, egomaniac narcissist.

      It's like the Inquisition of Galileo calling heliocentrism a "science-stopper" and complimenting themselves on their superior intelligence. Narcissistic egomaniacs.

      You compliment yourself on your superior intelligence. OK genius, please answer the simple, simple, simple, basic, undergrad level, simple questions below. Answer them, and grace us with your advanced intellect.

      Name ten nucleotides of non-coding DNA (out of 3 billion in the human genome) with a novel function discovered by creationists or ID proponents. Name just 10.

      Lots of functional regions have been discovered by evolutionists using evolutionary assumptions. Name ten nucleotides of non-coding DNA (out of 3 billion in the human genome) with a novel function discovered by anti-evolutionists using anti-evolutionary assumptions.

      So what's the real science-stopper, you stupid fuck?

      Cleaning Lady: "It [the genome] is looking more and more like a gigantic piece of software that runs numerous sections in parallel, each module providing inputs to the other sections so that the whole output is guided by many constantly changing variables."

      The genome certainly does not look like "a gigantic piece of software that runs numerous sections in parallel." That is exactly what you need to prove. You're assuming what you need to prove. The ENCODE results have not moved us much closer than that-- they found more regulatory elements, so that might be 8 or 9% of the genome.

      Cleaning Lady: "The trend in science is that there is a lot less junk than we thought..."

      "The trend." Like the trend that North America is approaching China via plate tectonics, at about the same speed.

      "A lot less." Way to be quantitative. What next? Casey Luskin's "Much DNA"?

      Can you quantify your "a lot less"? You're the advanced intellect here, right? Scientists are just cavemen compared to a "21st century" advanced intellect like you.

      So please, o great "21st century" advanced intellect, please enlighten us scientist-cavemen by telling us what fraction of nucleotides in the human genome are biochemically constrained in sequence by their function.

      Larry gave a specific number. Since you're a "21st century" advanced intellect and we scientists are just cavemen to your great 21st century brain, why don't you provide a number too, with references?

      Please provide a counter-argument for any of the POSITIVE arguments for junk DNA, which have been around for decades and which Larry has repeated over and over.

      If you don't answer the above simple, simple, simple, undergrad simple questions, if you evade these questions, then we have the right to call you an egomaniacal stupid fuck.

      Delete
    4. @Half Crazed Cleaning Lady

      Let's start by you explaining how this new assumption fits into the grand scheme of things, considering the Genetic Load problem.

      http://sandwalk.blogspot.de/2012/09/the-encode-data-dump-and-responsibility.html#more

      -The Other Jim

      Delete
    5. How much of this is really about semantics? I admit I haven't even skimmed all (or read more than two thoroughly) the papers that were published yesterday, but isn't this just a disagreement about the word "functional" and whether it should be used for all nucleotides that are transcribed whether they appear to affect any function in a cell or not? Or are the authors really claiming that by biochemically active this 80% of the genome affects phenotype? Could we as biologists agree to completely toss the term "junk DNA" (though I also thought we'd abandoned it) and rephrase it as something else that better describes its presence in a state that appears to have no current effect on phenotype? (But leave open the possibility that a function might exist but has not been identified yet . . as to not shut off avenues of research - though again I think the cleaning lady's argument on this point is spurious). I have a suspicion that the phrase "God didn't make no junk" is at the root of some of this vitriol - that some people just can't stand the idea that there is "junk" in the genome. Maybe if it were called something else .. . ? In the meantime, I'm still trying to think of relatively simple ways of explaining the presence of nonfunctional DNA to my students in the face of this media explosion.

      Delete
    6. I don't get it - who is the stone age tribesman supposed to phone?

      Delete
  7. That Guardian video is really quite embarrassing; amazing that they present this new work as if they discovered regulatory elements.

    ReplyDelete
  8. @coco:

    No, it doesn't make any sense at all to use the word "functional" to mean "may or may not have a function". This is _not_ about semantics.

    The term "junk DNA" conveys exactly what it denotes - sequence that is there and may be co-opted for some function in future, but is not functional right now. Here's Sydney Brenner on junk vs trash: "Everyone knows that you throw away trash. But junk we keep in the attic until there may be some need for it." Why would we want to replace such a clear term with something else that's less widely used?

    ReplyDelete
    Replies
    1. You have a point, but I guess I am still uncertain as to whether the ENcoDE authors really are definitely claiming phenotype affecting function for that 80%. I'm mostly bothered by the word functional and how the nonbiologists will interpret it. As to the quote about junk/trash...junk might never be useful, it might just take too much energy to take it to the dump.

      Delete
    2. Agreed, they do seem to be trying to redefine "functional" to mean something the general public will never pick up on. This is a Bad Thing (TM), and not an acceptable semantics game.

      Delete
  9. The Half Crazed Cleaning Lady from Sector &%@Thursday, September 06, 2012 3:17:00 PM

    Diogenes, your statement, "Back that up with evidence, or shut your lie-hole. Idiot can't prove that with any evidence. Egomaniac asshole complimenting herself for her supposed superior intelligence. Fuck you, egomaniac narcissist", is truly awesome. I am impressed by how you express yourself when you find yourself over your head! (Primate researchers, take note of a possible research subject) Do you, perchance, have a part time job as a stevedore?

    Anonymous: Genetic Loading is a biological example of something that can be observed and modeled in computer science, the corruption and/or loss of information in a large computer program (say, a copy of MS Office). All aspects of genetic load can be modeled computationally, which, in so doing, gives us a better understanding of what is going on in biology and shows that our genome contains something a lot closer to software than what the knuckle-draggers are wont to admit. Furthermore, we can computationally model the future of an organism or population focusing on the effects of genetic loading.

    In general to all: if you take the entire sequenced DNA, and treat it as a giant software package which is becoming increasingly corrupted, then factor in the effects of genetic drift and natural selection on such a package, you will take a giant leap forward in understanding what is going on in biology on the long term .... and you might want to find yourself quietly leaving Moran's sinking ship.

    ReplyDelete
    Replies
    1. Crazy Lady: "Kinda like a stone-age tribesman finding a cellphone and, since it fits no function so far as hunting and fishing goes, he chucks it into the river and continues to forage for grubs."

      Egomaniac shows up here pronouncing scientists are mere cavemen compared to her 21st-century advanced intellect.

      I asked this egomaniac simple, simple, simple, undergrad simple questions directly relevant to the fact-claims she raised.

      She weaseled out, evaded them, did not answer even one simple, simple, simple question, thus exposing herself as dumb as a box of hammers, but graced with a gigantic ego.

      Please provide evidence of your advanced 21st-century super-intellect, before which scientist are mere cavemen, by answering the simple questions, or else fuck off.

      Delete
    2. All aspects of genetic load can be modeled computationally, which, in so doing, gives us a better understanding of what is going on in biology and shows that our genome contains something a lot closer to software than what the knuckle-draggers are wont to admit.

      Funny... my software doesn't mutate with each subsequent install.

      And re: you answer - based on your vague walk around, you have no idea what you are talking about, correct? You didn't even answer the question.

      One more try, then exam question 2 will be how do you explain the megabase deletion mouse phenotype.

      -The Other Jim

      Delete
    3. Oops. The link did not work for the megabase deletion mouse.

      http://www.ncbi.nlm.nih.gov/pubmed/15496924

      -The Other Jim

      Delete