First, try Excel

One negative consequence of using PCs for so long is a tendency to do something you know will work–without first trying a long shot that might be a whole lot simpler.

Case in point, via an unfortunately long post:

  • I’m working on an “investigative” piece for Cites & Insights on the “biblioblogosphere” (not wild about the term, but it’s convenient shorthand).
  • The first step in that long process was to gather together a list of all the candidate blogs for the study–and the second step was to acquire the first measure in the study, a crude estimate of blog readership based on Bloglines subscription count.
  • The “list of candidates” came from three obvious sources–well, four, actually:
  • I figured the easiest way to do the first two steps was to click to each weblog (printing out the lists to avoid duplication), click on “Sub with Bloglines” on the FireFox Bloglines toolbar, and subscribe to up to three of the most general feeds (if there are multiple feeds).
  • Then, I could reset Bloglines to show all listings, click on each feed, add up the numbers, and jot down the number on the printed list–then unsub all but one feed for each weblog.

So I did that, trimming the list as I went based on my baseline criteria for inclusion in the study:

Weblogs by one or a small group of self-identified library people (not “official library” weblogs and not large-group weblogs such as PLA Blog and LISNews), with at least one posting in 2005 (some of the lists don’t edit out dead blogs), and at least one feed (because it’s too hard to investigate otherwise).

I haven’t been too strict on any of the criteria, tending toward inclusion rather than exclusion.

So, after a few hours’ work, I wound up with a Bloglines library section containing 239 weblogs. I wanted an Excel spreadsheet with the names of the weblogs in column A, the feed counts in column B, and other information in other columns as I did the rest of the investigation. And I sure didn’t want to type in 239 names!

I knew you could export a Bloglines list in “OPML.” That turns out to be XML, which is just text. So I opened it in Word, used wildcard replaces to get rid of everything but the blog name in each line, and saved it as a .txt, figuring I could just import it into Excel and go from there.

OK, the knowledgeable readers out there are saying “You idiot…”, but bear with me.

I fired up Excel, went to import, and noticed that the string of “Excel file” extensions in the default option in the open-file box included .html; a little horizontal scrolling showed that it also included .xml. I hadn’t deleted the Bloglines XML output, so I figured, “What the heck?”

Clicked on the Bloglines XML file and, whadda you know? A neat multicolumn spreadsheet with the names in one column, the URLs in another column, and I think one or two other columns, using the XML tags to label each column. Very neat. So I wasted five minutes doing pointless Word edits… And realized at that point that having the URLs in the spreadsheet just might be convenient.

The project continues (and boy, were those URLs convenient–particularly in one column where I could use a block replace to change “http://” to “link:” as needed). The list is down to a mere 237 weblogs; I’ve completed the “reach” portion, to determine the somewhere-between-20-and-70 blogs that will get full treatment. (I’ll probably post the whole spreadsheet somewhere when I publish the Perspective.)

Lesson: First see whether your regular software just might be able to do what you’d like it to do, before assuming you need to massage the data first. Pain to learn the lesson: None, really–the five minutes is a tiny slice out of the time this project is taking.

PS: I’d love to just use the Bloglines Directory to add up the subscribers for all feeds for a given weblog. Just one little problem: That directory, which includes several million weblogs, is only accessible by the first letter of the weblog name and lots of paging down. I tried that for a couple of weblogs; it’s just not supportable.

And as for “reach” and using Technorati as a natural source–that assumes (a) that Technorati stays stable long enough to do more than half a dozen searches and (b) that the results make sense (e.g., that you don’t get “zero results” for a URL and, one minute later, “353 lists link to this…” for the same URL. Unfortunately, neither seems to be the case. So I’m using other measures. It will all be written up, in what’s already turning out to be an interesting project.

13 Responses to “First, try Excel”

  1. Charles Bailey Says:

    Other library blog directories can be found at: http://www.escholarlypub.com/digitalkoans/2005/06/20/navigating-the-library-blogosphere/.

  2. walt Says:

    Charles,

    Thanks for that–and, whew, it doesn’t look as though any of the other directories would add substantially to my target list, since most of the ones not already checked are library weblogs, rather than weblogs by library people.

  3. Meredith Says:

    Here’s something I thought you should know about Bloglines feed counts that I just discovered. I’ve counted 5 feeds for my site (there used to be only 2) since I upgraded to WordPress 1.5 a few weeks ago. Yet the only subscriptions that come up now when Bloglines looks up my blog are two of the most recent (which show about 9 subscribers). I don’t know if this has happened to everyone who upgraded WordPress, but I thought I’d let you know.

    Another interesting way to look at Blog stats that just came out is blogpulse profiles http://www.blogpulse.com/profile (which determines a blog’s rank by the number of citations to it in the past 30 days). Blogpulse profiles offers all sorts of interesting stats on each blog.

  4. walt Says:

    Yeah, I’m aware that Bloglines subscription counts are as iffy as any other metric for measuring blog readership (or much of anything else on the web, except–maybe–your own site). I’ll write the many paragraphs of caveats after I do the investigation portion.

    The problem with Blogpulse is that it’s so much “of the minute,” fine if that’s what you’re looking for–also a problem with Technorati. Of course, this is all nonsense, but I think the article will be interesting nonsense. Now, off to do more investigation…

  5. Dorothea Salo Says:

    Another thing about Bloglines — Meredith hinted at it — is that a blog with more than one feed has split subscription numbers. I’m guessing you already dealt with that, but just in case you didn’t…

    If you’re really gung-ho, you can try searching LiveJournal for feeds aggregated there. Good luck, though; LJ’s search is worse even than any OPAC I’ve ever used. Best you can do is ask people to tell you if they’re aggregated over there. (I am. 17 LJ subscribers.)

  6. Dorothea Salo Says:

    Duh, of course you figured it out. Never mind.

    Searching on the blog (not feed) URL instead of trying to use Bloglines’ list (what were they thinking?) probably works.

  7. walt Says:

    As far as I can tell, the search box in Bloglines always searches blog posts, even if it’s on the directory page. I just tried a couple of variations of searches for blogs that I know have substantial numbers of feeds, with no results.

    As for how gung-ho I am: Not that gung-ho. Just as I deliberately limited Bloglines feeds to a maximum of three for any weblog, trying to pick the most obvious candidates. I’ve already concluded that any single measure of a weblog is likely to be wildly misleading–and, for that matter, that any combination of measures provides indication, not validation.

    I’ve actually finished (almost all of) the research now and am writing up the results and observations along the way. The report, which will be the feature article in the September C&I, will not have a big chart purporting to show exact importance rankings or anything of the sort; it will have a variety of metrics describing the group of blogs fully studied (which total 60 in all, out of the 239 I started with). And I’m removing the personal observations I was starting to make about some blogs, because I don’t need the grief–particularly from the one case where I honestly don’t know what to make of the changes in a blogger’s personality.

  8. Dorothea Salo Says:

    That’s probably a good idea. There’s enough grief in the world; no need to add more.

    Should be an interesting article. I look forward to it. Shame these metrics have to be so hard to compile, but I suppose there’s no other way unless you want to depend on self-report.

    If you need a blogger to pick on, you know I won’t bite you. :)

  9. tangognat Says:

    WordPress offers a feed for every category your blog has. I think it has had this kind of capability for some time, but it can be tricky to figure out the addresses for the category feeds.

    I’m looking forward to the article, it sounds interesting. Although I tend to like personal observations :)

  10. walt Says:

    Those category feeds are one of the reasons I’ll have loads of caveats about the quality of the data in the report. Some metrics are very good quality, some are much more iffy.

    Oh, there will be loads of personal observations–it’s C&I, after all–but, well, there’s one high-profile case that I’m just not ready to tackle. Neither is anyone else, apparently. I even wrote a blind item in my LISNews journal. Thought about it. And didn’t press Save. So I’m going to be light on “judgmental” commentary. This time around.

    If I had all the time in the world, I’d love to extend the study to, say, the “top 100″ in my Reach calculus–but as it is, this article far exceeds my usual “one hour per thousand words” guideline; it’ll run around 6,000-7,000 words, and I’ve put at least 20 hours of effort into it. Worth it, maybe. We’ll see.

  11. Daniel Cornwall Says:

    Good discussion you’ve got going here. Count me in as looking forward to the article. I’d say you could pick on my blog, but technically my LISNews Journal isn’t a blog!

    Thanks for your research and writing, always interesting.

    Completely off topic, I finished reading Being Analog a few months back. Still really relevant to discussions on the digital future. Plus your section on numbers literacy should be required reading in all Library School “Intro to Research Methods” classes.

  12. walt Says:

    I do get great discussions! Some of which, some day, will wind up as C&I fodder.

    Nah, I’m not counting the LISNews “blog lites” as blogs. Just as I’m not counting LISNews itself–partly because I don’t think it’s a blog, partly because it has considerably more than four authors.

    Thanks for the kind words about Being Analog–and particularly re the numeracy chapter. I still remember that Steve Coffman savaged me for that chapter in his review of the book, basically saying that no librarian could possibly not know all this stuff.

  13. Daniel Cornwall Says:

    Steve Coffman says many things.

    I actually started a book review of Being Analog, then I lost the book in a stack in my bedroom. It finally emerged last week. So, maybe I’ll get round to finishing and posting the review in the next few weeks. Not that I have the same name recognition as Steve Coffman. :-)


This blog is protected by dr Dave\\\\\\\'s Spam Karma 2: 104626 Spams eaten and counting...