Ever since I’ve used LISHost for various purposes–this blog throughout its history (except for a few months last year), Cites & Insights since mid-June 2006, my personal site since its inception–I’ve used Urchin to track site usage (unless Blake added Urchin more recently). Currently, my sites use Urchin 5. (Apparently, some LISHost sites on another server use Urchin 6, and none of this necessarily applies to them.)
I like Urchin. It defaults to a weekly view with a nice range of options, and you can expand it to a much broader timeline (although it runs into trouble if the timeline is too long or the logs to be analyzed too large: I’m not sure which). I’ve done reports on an entire year. For the reports I mostly care about–for C&I, file download figures (for PDF) and pageview figures (for HTML)–exporting reports works well. Robots (spiders) are separated out into a separate subsection. The number seem consistent–that is, there’s nothing in any of the numbers to suggest faulty logic, and at least some download/pageview numbers are consistent with what I’d expect from other sources.
Recently, I decided to try Google Analytics as an alternative (without disabling Urchin, to be sure). Urchin’s now owned by Google, and I believe Urchin 6 distinctly reflects that–and the ownership does mean that Urchin help is mostly not working very well. Unlike Urchin 5, Google Analytics doesn’t analyze server logs: You have to put tracking code on every page you want it to track, and it relies on calls to Google’s own servers. I only wanted to try it for Walt at Random, and since very page uses the “footer” code, it was easy enough to put the GA code segment into that portion of the site’s HTML–just before the “</body”> tag, as suggested by GA. (This clearly wouldn’t work well for Cites & Insights, where the numbers I’m most interested in are PDF downloads.)
I wanted to try GA partly because that’s currently the tracking method for use of the new Drupal Library Learning Network. (The old one used MediaWiki, which has strong usage-reporting built right into the system.)
The code went active on February 15, in the morning, and has now been active for a little more than a week.
And I don’t believe the results.
Some Quick Comparisons
Here’s what I find, comparing GA’s report covering February 15 through February 22 with Urchin’s for the same period–but noting that Urchin’s daily run was apparently yesterday morning, covering a small fraction of yesterday’s use and presumably making GA’s numbers higher by default:
- Sessions: GA reports 491 “visits.” Urchin reports 11,287 “sessions.” (No, there are no typos there: GA is reporting 4.3% of the number of sessions reported by Urchin–just over 1/25th.)
- Pageviews: GA reports 633 pageviews. Urchin reports 29,306. The difference here is even larger: GA is reporting 2.2% as many pageviews as Urchin.
- Visitors: GA reports 406 visitors (which means almost nobody came back–82.69% new visits). Urchin reports 2,005 IP addresses, which I take to be the same thing as visitors. A much smaller difference here, since Urchin seems to find people returning. Still, GA’s reporting only 20% as many different IP addresses as Urchin.
- Popular pages: GA says that only two current posts were visited 20 times or more–the “Social Networks/Social Media Snapshot” with 31 visits and “Open Access and Libraries: Be My Guest” with 29. (Things drop rapidly after that, with, for example, “Catching Up (sort of, a little bit)” getting 11 views.) By comparison, Urchin shows 206 pageviews for the Open Access post, 162 for Social Networks and 110 for “Catching Up”–and an LLN repost with 151 views in the middle.
At Least One Of These Must Be Wrong
So which is it? Does this blog have a very small readership with very active commenting, which would have to be the case for the GA numbers to be right, or is GA massively undercounting for various reasons?
While it wouldn’t much bother me if the first was true, it does seem a little out of proportion to the 830+ Feedreader subscriptions for this blog as of today–and, frankly, with the number of downloads for the Open Access and Libraries PDF. (28 during that same period.)
I’ve already been told (a) that Google Analytics won’t work if a user doesn’t have Javascript enabled or doesn’t allow cookies, (b) that GA is apparently intolerant of less-than-perfect HTML. It’s also quite possible that (c) I somehow mangled the code cut-and-paste–but in that case you’d expect no stats at all, or at least not the kind of stats I’m seeing. (161 pages visited during the 8 days–but visited very rarely.)
For the blog, I really don’t care. I’ll probably remove the GA tracking code after a while, and I’ll certainly rely on Urchin for numbers. For Cites & Insights, where there’s a reason to care, I can’t really use GA in any case–I’m not going to add tracking code to all the HTML articles, so all I’d be tracking is visits to the site, not readership for the publication.
For Library Leadership Network…well, there I care.
“that GA is apparently intolerant of less-than-perfect HTML”
I think this is your problem. I ran the front page through the W3C HTML validator (validator.w3.org) , and it complained extensively.
97 Errors, 5 warning(s)
Most of the errors are trivial, e.g. uppercase where the spec indicates lowercase. But there look to be some significant problems with various tags not being properly nested and closed. I suspect the GA code is not being parsed in many cases.
Seth: That’s really interesting–particularly given that I made wholly trivial modifications to the template I’m using. As time permits, I’ll run the validator again and see whether it’s plausible to fix things.
Of course, a broader question is whether I should need to: I’ve heard few complaints of people being unable to read the blog, and lots of other people use the same template. A stats package that requires 100% pure adherence to a notoriously forgiving code scheme is biased toward failure…not inherently a good thing.
Very interesting. I’ve used GA for our library website for its entire existence, and while I’ve noticed huge discrepancies between its numbers and those from the webhost statistics programs, I’ve stuck with GA in order to keep the data source consistent. I just decided to check the site’s markup again, though, and it is rife with errors due to some changes I made awhile back — although our hit counts in GA continue to go up, slowly but steadily.
Update: I’ve fixed all HTML errors in the primary templates (about 1/4 of them from other people’s code, 3/4 of them “errors” in capitalization from my own code). That is to say, the W3C validator now validates walt.lishost.org itself, as of 10 minutes ago, as valid HTML 1.0 Transitional.
There are probably still loads of HTML errors in individual posts going back beyond the first page, most of them more “errors” than errors. We’ll see what this does to the GA numbers. It would take bloody forever to go through all the archives, cleaning up each post’s “errors.” And, of course, log analysis doesn’t require such fussiness.