Chris Anderson of Wired has declared scientific method dead. And it's all thanks to Google, apparently, and the mass of data it is accummulating. Maybe Google really is making us stupid after all because the reasoning behind Anderson's conclusion is built on some shaky foundations.
Did Peter Norvig, Google's research director, really say: "All models are wrong, and increasingly you can succeed without them"? Because, if so, he seems to have misinterpreted what his own company has been doing. Yes, search and its related technologies do not rely on language models. But the core of all that Google does right now is based on a statistical approach that makes some basic assumptions about how language works. You might call it a model.
Anderson postulates a world based on machine learning, where the computer crunches through the data to come up with predictions.
"This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear...With enough data, the numbers speak for themselves."
Yet, machine-learning algorithms depend on the construction of some kind of model. It is not necessarily a deterministic model in the way that classical mechanics is, but just because it invokes statistics does not make it any less a model-based technique. What are models for? They allow you to make predictions about what will happen given some inputs.
OK, some branches of science are terrifyingly complex. Biology is the poster child for complexity. If you just take how DNA gets transcribed into RNA in a simple bacterium, there are thousands of potential interactions that get you to an RNA that will ultimately produce a protein. You get proteins sitting on the DNA that either encourage transcription or slow it down. Others bend the DNA round in weird shapes to activate a gene, but only when the conditions are just right. Yes, building a model of all these interactions is tough. But it is probably the only way of making sense of the processes and it is the way that biologists are making sense of the deluge of data. This is what systems biology is about.
They are using machine-learning and data-mining techniques to uncover patterns in the data. They are dredging through the seemingly countless genome and other 'ome databases to find data that they can plug into — yes, you guessed it — models.
Professor Jaroslav Stark of Imperial College sees modelling as a key to understanding what goes on inside living systems precisely because models are often inaccurate. For him, the fact that a model diverges from reality provides important clues to interactions that need to be taken into account. And they can provide a way to probe interactions where it is simply not possible to use traditional methods such as turning genes off selectively because that introduces other interactions.
The problem with Anderson's argument on this point is that, because what gets taught at school on biology has turned out to be inaccurate, we are getting further away from understanding through models. But that is what science is like: it finds new information, assimilates it and moves on. The biologists aren't finished yet, and aren't likely to be for another 30 years or so, even if they're lucky.
Anderson cites the work by J Craig Venter to sequence bacterial life in the oceans. A yacht is sailing around the world with a bucket to collect samples that get progressively filtered down until all you have left is bacterial DNA. This then gets dumped into an massive array of gene sequencers that randomly chop up the DNA with enzymes to produce fragments that can be separated to indicate the nucleic acids they contain. Computers then attempt to crunch through that data to reassemble the sequences into individual genomes. In practice, it's not possible to do that final step. At least, not right now. But, it is possible to see how much genes diverge among similar bacteria.
Venter has not really discovered unknown species of bacteria as Anderson writes because genetic sequence alone does not identify a species. Some of the putative genomes are very different to others, but Venter himself says that there is no percentage difference between genomes that will indicate a new species.
Basically, to identify a species, you have to go and look at how it lives and what it looks like. Maybe there is a shortcut to that process that involves the genome but until biologists fully understand the interplay between genes and the other bits of the genome, that is not going to be possible. It's probably easier with bacteria as they have comparatively little junk DNA, but it could still take some time. And the only way to build that model — even if it's a statistical one — is to assemble the genomes individually and examine the organisms. Not simply take a best guess as to how millions of fragments might match up in a genome.
What did Venter's team find? Based on predictions of the proteins that the assembled genomes produce, it seems that bacteria can have tuned versions of the light-sensitive protein proteorhodopsin. A single amino acid change in that sequence can alter the wavelength of light that the proteins absorbs and helps convert to energy. But that did not come just from blind number-crunching of the kind that Anderson suggests is the future. It was based on having a model of how rhodopsin works and then matching the gene data to it. Statistics helps, but there's still a model in there.
Big computers can certainly help with the creation and execution of models. But it seems unlikely that unleashing petaflops and petaflops on a problem blind is going to do much for machine learning.
Update: Now Kevin Kelly has chipped in, citing Google's translation system as evidence for the "stick it all in the Overmind/OneMachine" approach. Statistical language models have been kicking the structural models around the park for close to 40 years, and the techniques that work for search have some features in common with those that work in translation. What's happened with the web is that researchers have access to a huge corpus of text on which to train the systems. People are still working on the algorithms and they have to carefully pick the training corpus so as not to pollute the learning algorithm: the computers are just doing the boring legwork.
Kelly discounts idea of the approach killing scientific method. But dreams up a new term for it: "correlative analytics". This is hardly new. And questionably useful. As Robin comments below on the original version of this post, the finance community has been there, done that. Momentum trading is one 'algorithm' at the simple end of the spectrum. But it's basically taking outputs from a system and trying to use them as inputs. Not surprisingly, the results aren't all that spectacular.
However, if people want to believe that they can teach their computer biology by stuffing it full of all the genomics, proteomics, and other 'omics databases they can lay their hands on, I see no harm in letting them do it. However, the people doing real work on this stuff will be asking themselves: how was the data collected; what were the conditions? In short, while they may not read the data, they will attempt to understand how it came into being and then try to fit it into a model. It will get easier to automate some of those steps as labs adopt more standardised ways of generating the data, but we're still a long way from just stuffing bytes into a machine and let it figure it out for itself.