Your top 100 of blogger bloviation, or something

29 October 2007

The problem with some scientific research is not the research itself but the way people choose to use it. What better example than research on the blogosphere itself to show how you can twist a reasonably simple study for self-interested ends or just get it completely back-asswards? The reaction to the study itself is potentially the source of new research into blogger psychology: "I bloviate therefore I am".

A team from Carnegie Mellon University decided to look at how blogs link to each other as part of a wider study to look at where to put sensors to detect pollution or disease as quickly as possible without spending a shedload of money to put them everywhere. The slightly non-intuitive conclusion is that points with a high overall flow do not provide the best positions - it is those small channels that have the largest effect on the whole network where you want to have those sensors placed.

The team picked blogs as a study area largely because blogs have some interesting parallels with the spread of contagion through a network. They also make it easy to study that spread. They are time-stamped; they link to other blogs. You can trace the flow of 'information' relatively easily.

The researchers picked a large subset of blogs – 45 000 from a possible total of 2.5 million – and crunched through their links, taking account of which links went outside the dataset and which remained inside. They monitored posts that pointed to largish information cascades – effectively blogger pile-ons. To qualify, a subject had to accumulate 10 posts to be considered a cascade. That's big enough for a small pile-on in my book.

The CMU team then computed which blogs – from the subset they picked – were most likely to be a part of blogger pile-ons compared with those which had a high proportion of posts that were not. This gave them a cost function which led to a final list of 100 'top' blogs.

This is where the fun started. People on the list found that they were on some form of top 100 and started to brag about it. It's scientific so it must be true, was Neville Hobson's considered opinion. Then people started to wonder why a really weird bunch of blogs was considered to be the researchers' top 100. A commenter at Nick Carr's Rough Type wondered why a blog that had effectively been run off the farm by an angry mob was in the listing. Had they spent about ten seconds looking at the text at the top of the list, they might have realised that the corpus used by CMU came from 2006. That's right folks, this is not a list of current blogs - only those active up to about a year ago.

There is one other point that those crowing about being on the list might want to bear in mind. If it is any kind of ranking, this is a list of the pile-on addicts of 2006. If you wanted to know where to rubberneck at the biggest accidents on the blogiverse a year ago, these were your go-to guys.

Based on this, I think there is a strong argument for building a feedreader that uses this lot as a filter against your real list of RSS feeds: it would take out the mob rule and leave you with a lot more original information. (To be fair, there are some on the list I would want to keep in the feedreader).

The irony that this post itself is part of a pile-on is not lost on me.


Not sure if I can comment and thank you for bursting this bubble or if that makes me also guilty by association.

It's a tricky one, but you did the right thing ;-)

I can always count on you to do the research, filtering and critical thinking, Chris. Not to mention the unique and amusing interpretation that characterizes your posts. Thanks for that. (BTW, I checked and bloviation "isn't defined yet" in the Urban Dictionary: Consider submitting for the betterment of all-things-bloggy lexicon lovers.)

My question to you is in wherein lies the greater fault: the communication of the study from Carnegie Mellon University or the interpretation of it by interested parties, whether or not their blog was cited on the list?

I don't think there was much at fault in the communication from CMU. The students made it pretty clear what they were after. Although a lot of people went "ooh it's too hard, it's all full of maths", the natural-language description they gave of what they were doing was hardly opaque.

However, in putting together a top 100, it was a carrot too tempting for some to avoid. Nothing acts like link-bait like a top 10 or a top 100. I guess that was the trigger for a lot of people - why publish a ranking unless it was meant to be used as one?

But, if people had followed through the idea of this being a serious ranking of the blogs to read, they might have noticed a 'flaw' in the algorithm - it doesn't appear to correct for overlap among the ranked blogs. For example, in a real "high efficiency" reading list for posts most likely to spawn other posts, would you expect both Instapundit and Michelle Malkin to be there? The overlap there, I would expect to be very high (in terms of subject rather than treatment). In the context of an algorithm that you can generalise to water-monitoring or infection tracking, that's actually not a flaw. But it might have raised questions about the ranking if people had dug deeper believing it to be a true top 100.

The problem is, in the rush to post, people don't tend to look deep enough. Also, when casting around for things to post about, people tend to discount signals that tell them not to, choosing only to look at the things that encourage them to post.