There’s a question I’ve seen asked–or, in some cases, given assumed answers–any number of times, and I’ve never seen or had anything like an answer. The question, in one form at least:
How many liblogs are active at any given time?
I think the more interesting form of that question is “…and how has that changed over time?”
Maybe even the next question: “When were the most liblogs active?”
So far, I’ve never been able to come up with anything like an answer–and I’ve always felt that the presumed answer I’ve seen some times, based on my hobby/research, has been misleading. (That answer: “Around 500 at any given time.”)
The status
As of Friday, I’ve finished the drafts for Chapters 2 and 3 of The Liblog Landscape 2007-2010, my semi-comprehensive look at English-language liblogs (“semi-” because some have completely disappeared and some were doubtless missed), covering 1,304 liblogs that were still visible on the web in early summer 2010.
Chapter 2 is mostly methodology and background data.
Chapter 3 is sort of the “miscellany” chapter, also called “How, Where and When.” It covers the blogging software used for liblogs, the countries they come from, when they began and how long they’ve lasted (through May 31, 2010). It also covers currency–that is, how many weeks before June 1, 2010 the most recent post (up to May 31) appeared.
I plan to start Chapter 4 this week: The Big Picture, based on metrics that I hadn’t gathered in previous surveys. (That’s if I don’t decide to work on C&I essays instead; most likely, I’ll interleave the two). The metrics, for some but not quite all of the 1,304 blogs: How many posts the blog had from its inception through May 31, 2010–and a derivative figure, the average number of posts per month during the life of the blog.
The minor epiphany
This morning, as I was reading the Sunday paper, I had a thought:
I recorded the starting year and starting month in separate columns, both numbers…and the longevity of the blog in months in a third column. With the right formulas, I could use those three columns to determine whether a blog is active at any given point–where “active” means “had a post in or before this month and in or after this month.”
Actually, I first realized that it was possible to populate a huge array of 0s and 1s, then add up each column in that array to find out how many blogs were active at that point. It was a while later that I figured out the formula to populate that array automatically, making it not only possible but practical.
I also realized that this is another one of those little Excel chores that I probably wouldn’t have attempted, say, 5 or 10 years ago–I would have assumed that Excel would have broken or that it would take hours to do the calculations or that I just wouldn’t have enough memory to handle the whole thing.
See, here’s the thing: The matrix involves 46 columns and 1309 rows, with formulas in 42 of those columns and 1304 of those rows. That’s close to 55,000 formulas–each one different–and an overall matrix involving 60,214 cells. Just to get three or four tables and graphs, or maybe only one table and one graph.
The process
In reality, it wasn’t particularly difficult–I think it took half an hour, maybe a little longer (and that includes figuring out how to transpose a set of rows and columns to make later handling easier). While each of the 54,768 formulas (each involving a double-If) is different, they were mostly auto-generated. That is: I wrote the formula for the first row and column, copied it across the 42 columns, then modified the formula in each column (changing a relative cell to an absolute cell). Then I copied the 42-column row to the other 1,303 rows…and added (and copied) the summations to give me the 42 totals (actually four times that many, as there’s an interesting way to split the blogs).
How long did the calculation take? As usual, it seemed to be done as soon as I finished the copy-and-paste operation–certainly no more than a second. On a 1.6GHz Core 2 Duo notebook, not a wonderfully fast CPU. (The massive matrix with formulas seems to occupy about 200 megabytes400 kilobytes but of course I can delete the whole thing now that I have the summary numbers.) NOTE ADDED Monday, October 18: I saved that page as a spreadsheet for possible use next year. The spreadsheet uses about 400 kilobytes–the increase in size for the overall spreadsheet was closer to 200kilobytes, not megabytes!
The results
Ah, well, for that you’ll have to wait a while. Those results will be part of Chapter 4. If all goes as planned, all or part of that chapter may appear in a future C&I–most likely the second issue of 2011 that isn’t a single-essay issue. I’d guess that the book version will appear at the same time or a little earlier.
I wouldn’t have tried this five years ago and maybe not three years ago. I’m glad I did–the results are interesting (I would say “not quite what I expected” but I’m not sure I had an expectation).
Hmm. I wonder how long it’s been that it’s not only plausible but trivial to build and calculate a matrix involving more than 60,000 cells and nearly 55,000 different formulas? On a cheap notebook computer? Now there’s an interesting question for which I have no answer.