Bloglines 3: The saga continues

This continues the ongoing story that began here and continues here. This blather will be summarized to some extent in the introduction to “Looking at Liblogs: The Great Middle,” scheduled for the September 2006 Cites & Insights.

I actually did the same set of “reach” measures as last year for the full set of 282 potential candidates in the “great middle” category–well, more or less. Having a spreadsheet that includes name and URL for each candidate blog, prepared from the OPML output from Bloglines, made the process of searching for links quite rapid (much more so than however I did it last year).

This time around, I checked the link: count for Google (which is known to be somewhat meaningless–Google admits it’s only a partial result), MSN Search, and Yahoo! (in lieu of last year’s AllTheWeb, noting that Yahoo! owns AllTheWeb). But I also recorded one other figure, which I believe is much more meaningful than link: returns–that is, how many results Yahoo! would actually show me, using its default deduping “very similar” algorithms.

Before offering up some quick ratios for the 282 candidate blogs, remember that these blogs exclude around 90 librarian blogs with more than 196 Bloglines subscriptions as well as close to two hundred with fewer than 19 Bloglines subscriptions. Thus, most of the blogs likely to have the highest “reach” were excluded up front.

  • Google: Results ranged as high as 5,370, with some having no Google links; no ratio is possible.
  • MSN: Results ranged as high as 34,669–and again, some had no links, so no ratio is possible.
  • Yahoo!: Every blog in the great middle had at least five Yahoo! links, with a high of 179,000 or a ratio of 35,800:1, within this “middle” group.
  • Yahoo Results: This is, to be sure, artificially constrained (Yahoo always stops at or before 1000), but only three candidate blogs reached the 1,000 limit. The smallest number was 2, for a 500:1 limit. This number seems to be much less influenced by blogrolls and other factors that artificially inflate link results.
  • “Reach” using 2005 formula: The highest was 13,497; the lowest, 84. That’s a ratio of 161:1, considerably smaller than last year’s 7,778:1.
  • “Reach” using modified formula: When I adjusted deflators for the three link counts to match this year’s totals, the highest came down to 10,590, while the lowest only declined to 82, reducing the ratio to 128:1. Note that the “top 60” last year had a ratio of 65:1 between highest and lowest reach.
  • Plausible reach: I calculated a new ratio based on twice the Bloglines count plus the Yahoo Results count. That yielded a high of 1,387 and a low of 51, for a ratio of only 27.2:1.

I then trimmed the candidate set slightly, by dropping 9 blogs with “plausible reach” counts above 700 and 21 below 70, leaving 253 candidate blogs. Note that the ratio between highest and lowest “plausible reach” is only 10:1, a fairly narrow range–and the same as the ratio between most and fewest Bloglines subscribers.

I’ll make the “reach spreadsheet” available when I publish the article, for those who want to play with the sorting and formulas.

But…

I’m not using any reach factor, including Plausible Reach, as part of the metrics for individual blogs. These factors were only used to narrow the group of candidate blogs to a somewhat manageable number. I believe the new number is more, well, plausible than last year’s–but I don’t believe any “reach” numbers (specifically including Technorati, PubSub, et al) really tell you that much about a blog, particularly if the blogger isn’t aiming to be a superstar.

Now begins the interesting part: Looking at the blogs themselves, preparing brief comments, and preparing a set of metrics that highlights interesting aspects of individual blogs without attempting to rank them. I’ve dropped some metrics (Technorati, BlogPulse, link density). I’m dropping any comments about the “voice” of a blog, which makes particularly good sense since I’ve included blogs in languages other than English.

I’ve added a couple of “interesting items,” one of which is a metric of sorts: The topic of the first post within the test period (March through May 2006), and the topic and comment count for the post with the most comments during that period. The latter (which could disappear during the investigation) is based on a suggestion made on a WebJunction forum.

What order will blog notes appear in? Silly as it may be, the best choice turns out to be alphabetic: It’s not hierarchical and it makes sense to most people who use the latin alphabet. (Hey, my blog–which won’t be part of the study–certainly isn’t advantaged by an alphabetic sequence!)

Two notes in closing, for now:

  • I’m still soliciting feedback from bloggers who can ascertain the average daily number of sessions during May 2006 (or the total for the month) and the number of unique IP addresses during that month. Comment here or send me email; include the blog name. So far, frankly, there are no apparent correlations between these two factors and anything else–and maybe that’s the true result. Deadline: July 31.
  • Is it feasible to investigate 253 blogs? I honestly have no idea. I’ve allowed six weeks, but that’s only evening and weekend time, and there’s at least one column to be written, probably one little issue of C&I to do, maybe some other C&I essays, and maybe even a little vacation within that time. If I can do 10 an hour, I’ve got time. If I can’t do at least 6 an hour… Of course, I don’t know what the final count will be. Of the first six, one turned out to be an official blog, one didn’t begin until April 2006, and one ended in February 2006 (so the candidate pool is already down to 250)–and of the other three, one took five minutes to review and one [in French] took half an hour. So your guess is as good as mine. I’m sure I won’t lose half the candidates across the board, but I wouldn’t be surprised if the total declined to 200, maybe fewer.

6 Responses to “Bloglines 3: The saga continues”

  1. walt says:

    Ah, Thom, but I never went to library school.

    Which is probably why I’m not too strong on Mc vs. Mac…

    [Some time when I’m even older but no grayer, I might tell the story of how I got my first real library automation job, by being able to prove that the five call number systems used at UC Berkeley’s Doe Library could be keypunched with simple instructions to student staff, so that the cards would sort and interfile properly…based on observations during three+ years of paging, shelving, and other fun student library flunkie stuff. After professionals, not necessarily librarians, had tried three times and consistently given up. That was in 1968…]

  2. Walter Skold says:

    Hey Walt,

    Walter here.

    FYI. FREADOM http://4freadom.blogspot.com/

    is a library blog, in case the search didn’t grab us.

  3. walt says:

    WalterS: Well, you need to add it to the LISWiki blogs page.

    However…your total of Bloglines subscriptions is, shall we say, decidedly lower than the cutoff for this year’s study.

    Not that I’m lacking what could be considered right-of-center blogs; I’m not (and I don’t plan to apply political labels this year, after being flamed by NBruce last year).

    For some reason, Spam Karma 2 flagged your message as spam. So far, I can still check the list…although if the spammers get more active, I’ll give up.

  4. Laura says:

    Perhaps they’re explained elsewhere (or will be), or perhaps I’m just missing something obvious, but I’m somewhat baffled about what exactly the ratios you are measuring are. I get that it has something to do with how many links a blog is getting, as measured by various search engines, but beyond that I’m baffled.

    That said, I do look forward to seeing the finished study. There aren’t many people in the world who collect numbers, play with them, and then write thoughtfully about the process–I’m glad one of them is a member of library-land.

  5. walt says:

    Laura,

    The ratios mentioned in this post are the ratios between highest and lowest for any given measure, within the 282 “midrange” blogs for which I measured links.

    Thus, for example, the link: count (that is, the number returned at the top of the screen) on Yahoo! was 179,000 at its highest within this group of 282, and 5 within its lowest. That’s a ratio of 35800:1.

    Since I’d constrained the “midrange” group beforehand, only considering blogs with 19 to 196 total Bloglines subscriptions (roughly a 10:1 ratio), I found it interesting that, for the narrowed 252 blogs, the “plausible reach” ratio was also 10:1.

    All of the ratios mentioned here involve Bloglines count or link: results, including the “real” Yahoo! link: result–that is, how many sites it actually shows rather than the wildly higher number it displays above those results.

    Here’s the thing: NONE of these ratios will be part of the final article, at least not on a blog-by-blog basis (although the spreadsheet with all the ratios will be available for numbers junkies). What I’m doing now is looking at more interesting metrics based on March-May 2006–number of posts, amount of text and words-per-post, number of comments and comments-per-post, and couple of things I’m still trying out. None of these are “good” or “bad” metrics–but they are, I believe, interesting metrics.

    I’d like to say I’m not a numbers junkie (partly true, partly false), but I do try to be a sense-maker. Whether that will work out this time around…well, that’s not for me to say!

    My side project–asking for unique visitors and average visits per day during May–may come up with some “non-conclusions,” most probably one that no publicly-available metric really says all that much about how many readers a blog has. Fortunately, in 2006, the growing prevailing wisdom is that “how many readers” isn’t that important for most blogs.