Why Anonymize?

The project plan for Gold Open Access Journals 2011-2015 calls for me to make an anonymized version of the master spreadsheet freely available—and as soon as the project was approved, I made an anonymized version of the 2014 spreadsheet available.

Two people raised the question “Why anonymized?”—why don’t I just post the spreadsheet including all data, instead of removing journal names, publishers and URLs and adding a simple numeric key to make rows unique?

The short answer is that doing so would shift the focus of the project from patterns and the overall state of gold OA to specifics, and lead to arguments as to whether the data was any good.

Maybe that’s all the answer that’s needed. Although I counted very little use of the 2014 spreadsheet in January and February 2016, it’s been used more than 900 times in the first half of March 2016—but I have received no more queries as to why it’s anonymized. For any analysis of patterns, of course, journal names don’t matter. But maybe a slightly longer answer is useful.

That longer answer begins with the likelihood that some folks would try to undermine the report’s findings by claiming that the data is full of errors—and the certainty that such folks could find “errors” in the data.

Am I being paranoid in suggesting that this would happen? Thanks to Kent Anderson, I can safely say that I’m not, since within a day or two of my posting the spreadsheet, he tweeted this:

Anderson didn’t say “Am I misunderstanding?” or “Clarification needed” or any alternative suggesting that more information was needed. No: he went directly on the attack with “Errors exist” (by completely misreading the dataset, as it happens: around 500 gold OA journals began publication, usually not as OA, between 1853 and 1994).

It’s not wrong, it’s just different

To paraphrase Ed and Patsy Bruce (they wrote the song, even though Willie Nelson and Waylon Jennings had the big hit with it)…

If somebody else—especially someone looking to “invalidate” this research—goes back to do new counts on some number of journal, they will probably get different numbers in a fair number of cases.

Why? Several reasons:

  • Inclusiveness: Which items in journals—and which journals—do you include? The 2014 count tended to be more exclusive when I had to count each article individually; the 2015 count tends to include all items subject to some form of review, including book reviews and case reports. Similarly, the 2015 report includes journals that consist of (reviewed) conference reports (although I’ll note the subset of such journals).
  • Shortcuts: I did not in fact look at each and every item in each and every issue of each and every journal, compare it to that journal’s own criteria for reviewed or peer-reviewed, and determine whether to include it. To do that, I’d estimate that a single year’s count would require at least 2,000 hours exclusive of determining APC existence and levels and all other overhead—and, of course, a five-year study would require four times that amount (fewer journals and articles in earlier years). That’s not plausible under any circumstances. Instead, I used every shortcut that I could: publication-date indexes or equivalent for SciELO, J-Stage, MDPI, Dove and several others; DOI numbers when it’s clear they’re assigned sequentially; numbered tables of contents; Find (Ctrl-F) counts for distinctive strings (e.g., “doi:” or “HTML”) after quick scans of the contents tables. For the latter, I did make rough adjustments for clear editorials and other overhead.
  • Estimates: In some cases—fewer in 2015 than in 2014, but still some—I had to estimate, as for instance when a journal with no other way of counting publishes hundreds of articles each year and maintains page numbering throughout a dozen issues. I might count the articles in one or two issues, determine an average article length, and estimate the year’s total count based on that length. I also used counts from DOAJ in many cases, when those counts were plausible based on manual sampling.
  • Errors: I’m certain that my counts are off by one or two in some cases; that happens.
  • Late additions: Some journals, especially those that are issue-oriented and still include print versions, post online articles very late. Even though I’m retesting all cases where the “final issue” of 2015 seemed to be missing when checked in January-March 2016, it’s nearly certain that somebody looking at some journals in, say, August 2016 will find more 2015 articles than I did.

In practice, I doubt that any two counts of a thousand or more OA journals will yield precisely the same totals. I’d guess that I’m very slightly overcounting articles in some journals that provide convenient annual totals—and undercounting articles in some journals that don’t.

For the analysis I’m doing, and for any analysis others are likely to do, these “errors” shouldn’t matter. If somebody claimed that overall numbers were 5% lower or 5% higher, my response would be that this is quite possible. I doubt that the differences in counts would be greater than that, at least for any aggregated data.

Making the case

If you believe I’m wrong—that there are real, serious, worthwhile research cases where only the unanonymized version will do—let me know (waltcrawford@gmail.com).

Obviously, anonymized datasets aren’t unusual; I don’t know of any open science advocate who would seriously argue that medical data should be posted with patient names or that libraries should keep enough data to be able to do analysis such as “people who borrowed X also borrowed Y.” In practice, there may be special use cases for an open copy of the master spreadsheet. On the other hand, except for the list of journals flagged as having malware on their sites, I’ll be doing my analysis with the anonymized spreadsheet—it’s what’s needed for this work, and won’t distract me with individual journal titles and how I might feel about their publishers.

Comments are closed.