Scientific method's death a little premature

25 June 2008

Chris Anderson of Wired has declared scientific method dead. And it's all thanks to Google, apparently, and the mass of data it is accummulating. Maybe Google really is making us stupid after all because the reasoning behind Anderson's conclusion is built on some shaky foundations.

Did Peter Norvig, Google's research director, really say: "All models are wrong, and increasingly you can succeed without them"? Because, if so, he seems to have misinterpreted what his own company has been doing. Yes, search and its related technologies do not rely on language models. But the core of all that Google does right now is based on a statistical approach that makes some basic assumptions about how language works. You might call it a model.

Anderson postulates a world based on machine learning, where the computer crunches through the data to come up with predictions.

"This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear...With enough data, the numbers speak for themselves."

Yet, machine-learning algorithms depend on the construction of some kind of model. It is not necessarily a deterministic model in the way that classical mechanics is, but just because it invokes statistics does not make it any less a model-based technique. What are models for? They allow you to make predictions about what will happen given some inputs.

OK, some branches of science are terrifyingly complex. Biology is the poster child for complexity. If you just take how DNA gets transcribed into RNA in a simple bacterium, there are thousands of potential interactions that get you to an RNA that will ultimately produce a protein. You get proteins sitting on the DNA that either encourage transcription or slow it down. Others bend the DNA round in weird shapes to activate a gene, but only when the conditions are just right. Yes, building a model of all these interactions is tough. But it is probably the only way of making sense of the processes and it is the way that biologists are making sense of the deluge of data. This is what systems biology is about.

They are using machine-learning and data-mining techniques to uncover patterns in the data. They are dredging through the seemingly countless genome and other 'ome databases to find data that they can plug into — yes, you guessed it — models.

Professor Jaroslav Stark of Imperial College sees modelling as a key to understanding what goes on inside living systems precisely because models are often inaccurate. For him, the fact that a model diverges from reality provides important clues to interactions that need to be taken into account. And they can provide a way to probe interactions where it is simply not possible to use traditional methods such as turning genes off selectively because that introduces other interactions.

The problem with Anderson's argument on this point is that, because what gets taught at school on biology has turned out to be inaccurate, we are getting further away from understanding through models. But that is what science is like: it finds new information, assimilates it and moves on. The biologists aren't finished yet, and aren't likely to be for another 30 years or so, even if they're lucky.

Anderson cites the work by J Craig Venter to sequence bacterial life in the oceans. A yacht is sailing around the world with a bucket to collect samples that get progressively filtered down until all you have left is bacterial DNA. This then gets dumped into an massive array of gene sequencers that randomly chop up the DNA with enzymes to produce fragments that can be separated to indicate the nucleic acids they contain. Computers then attempt to crunch through that data to reassemble the sequences into individual genomes. In practice, it's not possible to do that final step. At least, not right now. But, it is possible to see how much genes diverge among similar bacteria.

Venter has not really discovered unknown species of bacteria as Anderson writes because genetic sequence alone does not identify a species. Some of the putative genomes are very different to others, but Venter himself says that there is no percentage difference between genomes that will indicate a new species.

Basically, to identify a species, you have to go and look at how it lives and what it looks like. Maybe there is a shortcut to that process that involves the genome but until biologists fully understand the interplay between genes and the other bits of the genome, that is not going to be possible. It's probably easier with bacteria as they have comparatively little junk DNA, but it could still take some time. And the only way to build that model — even if it's a statistical one — is to assemble the genomes individually and examine the organisms. Not simply take a best guess as to how millions of fragments might match up in a genome.

What did Venter's team find? Based on predictions of the proteins that the assembled genomes produce, it seems that bacteria can have tuned versions of the light-sensitive protein proteorhodopsin. A single amino acid change in that sequence can alter the wavelength of light that the proteins absorbs and helps convert to energy. But that did not come just from blind number-crunching of the kind that Anderson suggests is the future. It was based on having a model of how rhodopsin works and then matching the gene data to it. Statistics helps, but there's still a model in there.

Big computers can certainly help with the creation and execution of models. But it seems unlikely that unleashing petaflops and petaflops on a problem blind is going to do much for machine learning.

Update: Now Kevin Kelly has chipped in, citing Google's translation system as evidence for the "stick it all in the Overmind/OneMachine" approach. Statistical language models have been kicking the structural models around the park for close to 40 years, and the techniques that work for search have some features in common with those that work in translation. What's happened with the web is that researchers have access to a huge corpus of text on which to train the systems. People are still working on the algorithms and they have to carefully pick the training corpus so as not to pollute the learning algorithm: the computers are just doing the boring legwork.

Kelly discounts idea of the approach killing scientific method. But dreams up a new term for it: "correlative analytics". This is hardly new. And questionably useful. As Robin comments below on the original version of this post, the finance community has been there, done that. Momentum trading is one 'algorithm' at the simple end of the spectrum. But it's basically taking outputs from a system and trying to use them as inputs. Not surprisingly, the results aren't all that spectacular.

However, if people want to believe that they can teach their computer biology by stuffing it full of all the genomics, proteomics, and other 'omics databases they can lay their hands on, I see no harm in letting them do it. However, the people doing real work on this stuff will be asking themselves: how was the data collected; what were the conditions? In short, while they may not read the data, they will attempt to understand how it came into being and then try to fit it into a model. It will get easier to automate some of those steps as labs adopt more standardised ways of generating the data, but we're still a long way from just stuffing bytes into a machine and let it figure it out for itself.


Nice to know that I'm not the only one who spat his coffee when he read that Wired headline. The maxim of crap in = crap out is as true as ever.

However I think it accurately reflects how people are not thinking enough about the limits of computers. Theories are going untested, like the theory that machines can go and trade on the markets and make money in the long term for the investor chumps...

The original use of the term data mining was pejorative: if you have enough data and search long enough, you can always find some model that fits your data arbitrarily well. The recent bestsellers "Fooled by Randomness" and "Black Swan" point out that we often see patterns in randomness because of our proclivity for data mining in this pejorative sense.

I don't know much about Chris Anderson's educational background, but Peter Norvig surely knows better than to think that "correlation supersedes causation". You can't get through an elementary course in statistics without seeing a myriad examples where correlation misleadingly suggests causation (e.g., higher death rates in nicer climates because more people retire there).

Say what you will about the quality of our available scientific models, but the scientific method of hypothesis testing is here to stay.

You comment that it is not possible to do normal science in the genome, as it is not strictly accurate to say that one thing can be turned off at a time in the genome, where everything interacts closely with everything else.

A study published in the journal Circulation about the formal testing of new products found that:

1. Products hit many “unintended targets” as well as the “intended therapeutic target”. In the pharmaceutical industry there are no monitoring programmes for the effect on "unintended targets".

2. Pharmaceutical companies do not continue to study the long-term effects of a product after it has been released on the market.

3. Product trials only test products used on their own – but most products are used in combination with other products. There is no testing to see what happens when different products are used together.

How do you factor “unintended targets” into risk assessment? You can’t speculate about unknown risk. Does this makes sense of GE industry assurances that they are unaware of any risk in genetic engineering?

Is it realistic to ask that industry makes product safety and risk its highest priority? Commercial survival has to be their highest priority or staff will lose their jobs. Research is expensive. A new product is only profitable for the first ten years of its life so there is forced pressure to generate credible-looking scientific research and rush to market before the thing is fully tested or understood.

Where stock-market-listed organisations "own" the science, it is proving to be a public health and safety risk because companies are punished through share price if they publish negative results about that science. It is unrealistic to expect companies to publish negative data on their own products. It is like asking players to referee their own football match.

Use of patents in the genetic engineering arena will inhibit sharing of knowledge in and will act as a disincentive to safe product development. The stock market rewards dangerous practice. We need another way of rewarding invention.

Peer review is also past its sell-by-date. We need something more robust than a polite gentleman’s agreement to underwrite global public health and safety. Personal Injury law suits currently do the work of peer-review. How much sense does that make?


Lessons Learned From Recent Cardiovascular Clinical Trials:
Part I: DeMets and Califf 106 (7): 880 – Circulation
Part II 106 (6): 746 – Circulation

Principles From Clinical Trials Relevant to Clinical Practice:
Part I -- Califf and DeMets 106 (8): 1015 – Circulation
Part II -- Califf and DeMets 106 (9): 1172 – Circulation