This continues the ongoing story that began here and continues here. This blather will be summarized to some extent in the introduction to “Looking at Liblogs: The Great Middle,” scheduled for the September 2006 Cites & Insights.
I actually did the same set of “reach” measures as last year for the full set of 282 potential candidates in the “great middle” category–well, more or less. Having a spreadsheet that includes name and URL for each candidate blog, prepared from the OPML output from Bloglines, made the process of searching for links quite rapid (much more so than however I did it last year).
This time around, I checked the link: count for Google (which is known to be somewhat meaningless–Google admits it’s only a partial result), MSN Search, and Yahoo! (in lieu of last year’s AllTheWeb, noting that Yahoo! owns AllTheWeb). But I also recorded one other figure, which I believe is much more meaningful than link: returns–that is, how many results Yahoo! would actually show me, using its default deduping “very similar” algorithms.
Before offering up some quick ratios for the 282 candidate blogs, remember that these blogs exclude around 90 librarian blogs with more than 196 Bloglines subscriptions as well as close to two hundred with fewer than 19 Bloglines subscriptions. Thus, most of the blogs likely to have the highest “reach” were excluded up front.
- Google: Results ranged as high as 5,370, with some having no Google links; no ratio is possible.
- MSN: Results ranged as high as 34,669–and again, some had no links, so no ratio is possible.
- Yahoo!: Every blog in the great middle had at least five Yahoo! links, with a high of 179,000 or a ratio of 35,800:1, within this “middle” group.
- Yahoo Results: This is, to be sure, artificially constrained (Yahoo always stops at or before 1000), but only three candidate blogs reached the 1,000 limit. The smallest number was 2, for a 500:1 limit. This number seems to be much less influenced by blogrolls and other factors that artificially inflate link results.
- “Reach” using 2005 formula: The highest was 13,497; the lowest, 84. That’s a ratio of 161:1, considerably smaller than last year’s 7,778:1.
- “Reach” using modified formula: When I adjusted deflators for the three link counts to match this year’s totals, the highest came down to 10,590, while the lowest only declined to 82, reducing the ratio to 128:1. Note that the “top 60″ last year had a ratio of 65:1 between highest and lowest reach.
- Plausible reach: I calculated a new ratio based on twice the Bloglines count plus the Yahoo Results count. That yielded a high of 1,387 and a low of 51, for a ratio of only 27.2:1.
I then trimmed the candidate set slightly, by dropping 9 blogs with “plausible reach” counts above 700 and 21 below 70, leaving 253 candidate blogs. Note that the ratio between highest and lowest “plausible reach” is only 10:1, a fairly narrow range–and the same as the ratio between most and fewest Bloglines subscribers.
I’ll make the “reach spreadsheet” available when I publish the article, for those who want to play with the sorting and formulas.
I’m not using any reach factor, including Plausible Reach, as part of the metrics for individual blogs. These factors were only used to narrow the group of candidate blogs to a somewhat manageable number. I believe the new number is more, well, plausible than last year’s–but I don’t believe any “reach” numbers (specifically including Technorati, PubSub, et al) really tell you that much about a blog, particularly if the blogger isn’t aiming to be a superstar.
Now begins the interesting part: Looking at the blogs themselves, preparing brief comments, and preparing a set of metrics that highlights interesting aspects of individual blogs without attempting to rank them. I’ve dropped some metrics (Technorati, BlogPulse, link density). I’m dropping any comments about the “voice” of a blog, which makes particularly good sense since I’ve included blogs in languages other than English.
I’ve added a couple of “interesting items,” one of which is a metric of sorts: The topic of the first post within the test period (March through May 2006), and the topic and comment count for the post with the most comments during that period. The latter (which could disappear during the investigation) is based on a suggestion made on a WebJunction forum.
What order will blog notes appear in? Silly as it may be, the best choice turns out to be alphabetic: It’s not hierarchical and it makes sense to most people who use the latin alphabet. (Hey, my blog–which won’t be part of the study–certainly isn’t advantaged by an alphabetic sequence!)
Two notes in closing, for now:
- I’m still soliciting feedback from bloggers who can ascertain the average daily number of sessions during May 2006 (or the total for the month) and the number of unique IP addresses during that month. Comment here or send me email; include the blog name. So far, frankly, there are no apparent correlations between these two factors and anything else–and maybe that’s the true result. Deadline: July 31.
- Is it feasible to investigate 253 blogs? I honestly have no idea. I’ve allowed six weeks, but that’s only evening and weekend time, and there’s at least one column to be written, probably one little issue of C&I to do, maybe some other C&I essays, and maybe even a little vacation within that time. If I can do 10 an hour, I’ve got time. If I can’t do at least 6 an hour… Of course, I don’t know what the final count will be. Of the first six, one turned out to be an official blog, one didn’t begin until April 2006, and one ended in February 2006 (so the candidate pool is already down to 250)–and of the other three, one took five minutes to review and one [in French] took half an hour. So your guess is as good as mine. I’m sure I won’t lose half the candidates across the board, but I wouldn’t be surprised if the total declined to 200, maybe fewer.