"Words are there for themselves"

25 April 2008

Today's Graph of Nonsense Award goes to the one reproduced by Erick Schonfeld at Techcrunch to try to explain the chocolately goodness that lies behind Radar Networks' plan to apply Semantic Web technologies to search. It's one of those classic startup-generated graphs that purport to show how existing, in their view clapped-out, technology is going to give way to the shiny new stuff.

semsearch.jpg

The argument relayed to Schonfeld by Radar's Nova Spivack is that today's search engines, which are based on keyword searching, are running out of steam. Spivack's contention is that the volume of data is going to overwhelm keyword-based technologies soon and that what you need is to add meaning to the underlying text to help the poor old search engines out. And, in startup style, the argument runs that the established search engines cannot deal with a root-and-branch reworking of their algorithms.

For some bizarre reason, Spivack puts down tagging and natural language search as points on the way to semantic search, rather than having semantic search before natural language, which seems to be his argument. And then we get to "reasoning": presumably at the point where we hit the Singularity or something.

Before everyone gets excited about Google going the way of AltaVista, we should take a step back and have a look at what goes on with keyword searching and then listen to what one of the technique's creators had to say on the subject of the Semantic Web.

One striking aspect of search technology is the resilience of that old-hat keyword technology. The core of most search engines is an algorithm developed in the early 1970s, primarily by the late Karen Spärck Jones and Stephen Robertson, who is now at Microsoft Labs in Cambridge. At the time, people believed that computers would have to understand grammar. The breakthrough made by Spärck Jones and Robertson, among others, was that you didn't need to force the computer to process grammar. The world is just too complicated for a machine to handle more than simple phrases.

But statistical processing works amazingly well. The algorithm that resulted from this work is amazingly simple. You calculate the frequency of each word in a document but knock out all the words that are found in most documents. Documents only score highly for words that are found in a small subset of files. It's why names and specific terms work so well in locating documents when you are using Google.

Spivack contends: "Keyword search engines return haystacks, but what we really are looking for are the needles."

No, the TF-DF algorithm does indeed return needles, almost by definition, because you don't bother indexing most of the haystack. The failing of the technology is that you have to know what the needles are called. But there are ways around the problem that do not demand the introduction of an additional layer of data.

Spivack points to the problem of citation in driving the results provided by Google as a problem of keyword searching. It has very little to do with keyword search - that is Google PageRank at work, which is a technique that gets overlaid onto the other statistical mechanisms to provide the search engine with a bit more guidance as to which pages most people find useful. Search-engine researchers are sceptical as to how much even Google uses PageRank to rank pages these days. As spam and SEO techniques get more intrusive, I can see PageRank getting dumped entirely.

The important thing to bear in mind about today's search is not that it works using keywords: it's that it uses statistics, not any form of what you might call deterministic processing. The mechanisms have been augmented over the years into what the head of search specialist Autonomy, Dr Mike Lynch, calls "keyword plus".

Lynch contents that the Autonomy engine has gone beyond the TF-IDF algorithm of Spärck Jones and Robertson. But it still, very much, uses statistics. Autonomy uses Bayesian inference which dispenses with the concept of word frequency. Autonomy replaces it with an approach that uses the probability of finding certain words within a portion of the document. One of the problems with Bayesian inference is that the number of calculations you need to make is much higher than with keyword indexing if you are not careful. But there some tricks you can pull to reduce the workload.

The important thing is that every successful search engine in use today doesn't have to understand language by attempting to deconstruct sentences using grammar rules or trying to extract meaning: it just has to use statistics. Spärck Jones' final lecture, which was recorded about a month before her death from cancer at the age of 71, describes how successful statistical processing has been when dealing with language. In fact, even language researchers have found statistical models more successful than deterministic grammar models in recent years.

As Robertson put it to me when I interviewed him a week ago, there is plenty of scope for statistical processing in search and similar applications.

Statistical methods that evolve from the ones in use today probably have a lot brighter future than any kind of semantic search. The problem with the Semantic Web is that you have to add a new layer of information to the information you have already created through the medium of XML tags. I have nothing against annotating certain types of information. Microformats should make the meaning of data such as names and addresses more apparent to the machine.

But the idea of annotating massive chunks of text with helpful tags is, frankly unworkable. You have a bunch of words that, taken in context, mean something. With the Semantic Web, you are then adding another bunch of words to, hopefully, provide hints that disambiguate the text. And you are doing this at the same time that research is using statistics to allow the context around those words to provide clues to the meaning.

Or as Spärck Jones put it in her lecture: "Words are there for themselves. They are not being replaced by other codes."

One big problem with the Semantic Web is that, to make it useful to a computer, you have to produce a dictionary of meanings: an ontology. Building ontologies is hard. Very, very hard. People in fields such as medicine are still having trouble, and they've been doing it for a while.

Maybe tagging provides a halfway house. Some tags are useful but these turn out to be recommendations more than tags of meaning. Lynch explains: "Tagging has its uses but they are much more limited than people realise. It is fundamentally flawed as a retrieval technology because of specificity. There are not enough layers in the tagging."

The problem is that you need to add a lot of tags to provide the specificity you need, other than simply pointed to a key word itself and making it, well, a keyword. "People say they will add more tags," Lynch says. "But by the time you add them all, the probability that the person looking won’t go in the same order through the tag hierarchy is very high.

"The other thing is when you go to the 'right on' social ideas, such as folksonomy," Lynch adds. "People in different locales put on different tags. Then you get ‘metatoxins’: when they put the tag Britney Spears on a YouTube video it is actually them painting a wall."

"Where tagging is very useful is working out whether a document is helpful. Say, if you get people to tag stuff as tutorial documents or whether the document is useful, getting a human to make that comment is better than a computer."

If you talk to artificial intelligence researchers, they don't mind having the information provided by Semantic Web technologies or tags from keen Web 2.0 users. They will, as it were, take as much information about meaning as they can get. But, when it came to search and information retrieval Spärck Jones had no illusions about the effectiveness of the Semantic Web:

"The idea of a semantic web as a universal characterisation of knowledge strikes me as misconceived. There can be no ontology that will work for everything and everyone. There may be many specific ontoslogcal horses for particular courses. But the universal means for getting from one to another cannot but be a sort of lightweight ropeway.

"Something lke that does establish links and make it possible to move around. This is what natural language tools like statistical association between words provide. People will get a start moving from simple words and phrases...Natural language is general but it leads to particulars."

Ripping out the guts of today's search engines and replacing them with a very different engine is not going to work. The search engine of choice in ten years' time may not be Google: it could have gone the way of AltaVista. But it is unlikely to be replaced by something that does not use a statistical or probabilistic engine at its core.

3 Comments

I couldn't aggree much more. The problem with search today is that we treat all needles equally, and while some of us are very good at creating a search that will generate what we want, most people aren't.

The scenario that I have been spinning for the past few days is what if Oprah got hit by the "L" in Chicago. Would you search for Oprah killed by train? or Winfrey in L accident? or Oprah dies on Subway?

In an ideal world any of these would bring up the "best" page about the topic, but that isn't often the case.

And then there is always this important insight on the topic:

http://www.lyricsfreak.com/t/tom+tom+club/wordy+rappinghood_20138758.html

@Luke

And to think they said: "Words won't find no solution".