One negative consequence of using PCs for so long is a tendency to do something you know will work–without first trying a long shot that might be a whole lot simpler.
Case in point, via an unfortunately long post:
- I’m working on an “investigative” piece for Cites & Insights on the “biblioblogosphere” (not wild about the term, but it’s convenient shorthand).
- The first step in that long process was to gather together a list of all the candidate blogs for the study–and the second step was to acquire the first measure in the study, a crude estimate of blog readership based on Bloglines subscription count.
- The “list of candidates” came from three obvious sources–well, four, actually:
- I figured the easiest way to do the first two steps was to click to each weblog (printing out the lists to avoid duplication), click on “Sub with Bloglines” on the FireFox Bloglines toolbar, and subscribe to up to three of the most general feeds (if there are multiple feeds).
- Then, I could reset Bloglines to show all listings, click on each feed, add up the numbers, and jot down the number on the printed list–then unsub all but one feed for each weblog.
So I did that, trimming the list as I went based on my baseline criteria for inclusion in the study:
Weblogs by one or a small group of self-identified library people (not “official library” weblogs and not large-group weblogs such as PLA Blog and LISNews), with at least one posting in 2005 (some of the lists don’t edit out dead blogs), and at least one feed (because it’s too hard to investigate otherwise).
I haven’t been too strict on any of the criteria, tending toward inclusion rather than exclusion.
So, after a few hours’ work, I wound up with a Bloglines library section containing 239 weblogs. I wanted an Excel spreadsheet with the names of the weblogs in column A, the feed counts in column B, and other information in other columns as I did the rest of the investigation. And I sure didn’t want to type in 239 names!
I knew you could export a Bloglines list in “OPML.” That turns out to be XML, which is just text. So I opened it in Word, used wildcard replaces to get rid of everything but the blog name in each line, and saved it as a .txt, figuring I could just import it into Excel and go from there.
OK, the knowledgeable readers out there are saying “You idiot…”, but bear with me.
I fired up Excel, went to import, and noticed that the string of “Excel file” extensions in the default option in the open-file box included .html; a little horizontal scrolling showed that it also included .xml. I hadn’t deleted the Bloglines XML output, so I figured, “What the heck?”
Clicked on the Bloglines XML file and, whadda you know? A neat multicolumn spreadsheet with the names in one column, the URLs in another column, and I think one or two other columns, using the XML tags to label each column. Very neat. So I wasted five minutes doing pointless Word edits… And realized at that point that having the URLs in the spreadsheet just might be convenient.
The project continues (and boy, were those URLs convenient–particularly in one column where I could use a block replace to change “http://” to “link:” as needed). The list is down to a mere 237 weblogs; I’ve completed the “reach” portion, to determine the somewhere-between-20-and-70 blogs that will get full treatment. (I’ll probably post the whole spreadsheet somewhere when I publish the Perspective.)
Lesson: First see whether your regular software just might be able to do what you’d like it to do, before assuming you need to massage the data first. Pain to learn the lesson: None, really–the five minutes is a tiny slice out of the time this project is taking.
PS: I’d love to just use the Bloglines Directory to add up the subscribers for all feeds for a given weblog. Just one little problem: That directory, which includes several million weblogs, is only accessible by the first letter of the weblog name and lots of paging down. I tried that for a couple of weblogs; it’s just not supportable.
And as for “reach” and using Technorati as a natural source–that assumes (a) that Technorati stays stable long enough to do more than half a dozen searches and (b) that the results make sense (e.g., that you don’t get “zero results” for a URL and, one minute later, “353 lists link to this…” for the same URL. Unfortunately, neither seems to be the case. So I’m using other measures. It will all be written up, in what’s already turning out to be an interesting project.