Open Data, Crowdsourcing, Independent Research and Misgivings

or Why Some Spreadsheets Probably Won’t Become Public

If you think that title is a mouthful, here’s the real title:

Why I’m exceedingly unlikely to make the spreadsheet(s) for my OA journals investigations public, and why I believe it’s reasonable not to do so.

For those of you on Friendfeed, there was a discussion on specifically this issue beginning August 26, 2014. The discussion was inconclusive (not surprisingly, partly because I was being a stubborn old goat), and I continued to think about the issues…even as I continued to build the new spreadsheet(s) for the project I hope to publish in the November and December 2014 Cites & Insights, if all goes well, cross several fingers and toes.

Consider this a public rethinking. Comments are most definitely open for this post (if I didn’t check the box, let me know and I’ll fix it), or you’re welcome to send me email, start a new thread on one of the social media I frequent (for this topic, Friendfeed or the OA community within Google+ seem most plausible), whatever…

Starting point: open data is generally a good idea

There may be some legitimate arguments against open datasets in general, but I’m not planning to make them here. And as you know (I suspect), I’m generally a supporter of open access; otherwise, I wouldn’t be spending hundreds of unpaid hours doing these investigations and writing them up.

All else being equal, I think I’d probably make the spreadsheet(s) available. I’ve done that in the past (the liblog projects, at least some of them).

But all else is rarely equal.

For example:

  • If a medical researcher released the dataset for a clinical trial in a manner that made it possible to determine the identities of the patients, even indirectly, that would be at best a bad thing and more likely actionable malpractice. Such datasets must be thoroughly scrubbed of identifying data before being released.

But of course, the spreadsheets behind Journals, “Journals” and Wannabes: Investigating The List have nothing to do with clinical trials; the explicitly named rows are journals, not people.

That will also be true of the larger spreadsheets in The Current Project.

How much larger? The primary worksheets in the previous project have, respectively, 9,219 [Beall’s Lists] and 1,531 [OASPA] data rows. The new spreadsheets will have somewhere around 6,779 [the subset of Beall’s Lists that was worth rechecking, but not including MDPI journals], exactly 1,378 [the subset of OASPA journals I rechecked, including MDPI journals], and probably slightly fewer than 3,386 [the new “control group,” consisting of non-medicine/non-biology/non-biomed journals in DOAJ that have enough English in the interface for me to analyze them and that aren’t in one of the other sets] rows—a total of somewhere around 11,543. But I’m checking them more deeply; it feels like a much bigger project.

So what’s the problem?

The spreadsheets I’ve built or am building are designed to allow me to look at patterns and counts.

They are not designed for “naming and shaming,” calling out specific journals in any way.

Yes, I did point out a few specific publishers in the July article, but only by quoting portions of their home pages. It was mostly cheap humor. I don’t plan to do it in the new project—especially since most of the journals in the new control group are from institutions with only one or a handful of journals; I think there are some 2,200 publisher names for 3,386 journals.

This is an important point: The July study did not name individual journals and say “stay away from this one, but this one’s OK.” Neither will the November/December study. That’s not something I’m interested in doing on a journal-by-journal or publisher-by-publisher basis. I lack the omniscience and universal subject expertise to even begin to consider such a task. (I question that anybody has such omniscience and expertise; I know that I don’t.) I offered possible approaches to drawing your own judgment, but that’s about it.

Nor do I much want to be the subject of “reanalysis” with regard to the grades I assigned. (I don’t want angry publishers emailing me saying “You gave us a C! We’re going to sue you!” either—such suits may be idiotic, but I don’t need the tsuris.)

Releasing the full spreadsheets would be doing something I explicitly do not want to do: spreading a new set of journal grades. There is no Crawford’s List, and there won’t be one.

For that matter, I’m not sure I much want to see my numbers revalidated: for both projects, I use approximation in some cases, on the basis that approximation will yield good results for the kind of analysis I’m doing. (I’ll explain most of the approximation and shortcuts when I write the articles; I try to be as transparent as possible about methodology.)

For those reasons and others, I would not be willing to release the raw spreadsheets.

Could you randomize or redact the spreadsheets to eliminate these problems?

Well, yes, I could—but (a) that’s more unpaid labor and, more important, (b) I’m not sure the results would be worth much.

Here, for example, are the data label rows and one (modified) data row from part of the current project:

Pub Journal 2014 2013 2012 2011 Start Peak Sum Gr GrF APC Note
pos POS Physics 15 34 14 1 2011 34 64 B $600

The columns, respectively, show: the publisher code (in this case, Pacific Open Science, a nonexistent—I think—publisher I may use to offer hypothetical examples in the discussion. Their slogan: If an article is in our journals, it’s a POS!); the journal name; the number of articles in January-June 2014, all of 2013, all of 2012, all of 2011; the starting year; the peak annual articles; the sum of the four years; the letter grade; a new “GrF”—the letter grade that journals with fewer than 20 articles per year would get if they had more; the article processing charge for a 10-page article; and any note I feel is needed. (If this was the new DOAJ control group, there would be another column, because hyperlinks were stored separately in DOAJ’s spreadsheet; for the one I chose, “POS Physics” is itself a hyperlink—but, of course, there’s no such journal. Don’t try to guess—the actual journal’s not remotely related to physics.)

I’ll probably add a column or two during analysis—e.g., the maximum annual APCs a given journal could have collected, in this case 34×600 or $20,400, and for the new DOAJ group the subject entry to do some further breakdowns.

I could certainly randomize/redact this spreadsheet in such a way that it could be fully re-analyzed—that is, sort the rows on some combination that yields a semi-random output, delete the Pub column, and change the Journal column to a serial number equal to the row. Recipients would have all the data—but not the journal or publisher names. That wouldn’t even take very long (I’d guess ten minutes on a bad day).

Would anybody actually want a spreadsheet like that? Really?

Alternatively, I could delete the Gr and GrF columns and leave the others—but the fact is, people will arrive at slightly different article counts in some significant percentage of cases, depending on how they define “article” and whether they take shortcuts. I don’t believe most journals would be off by more than a few percentage points (and it’s mostly an issue for journals with lots of articles), but that would still be troublesome.

Or, of course, I could delete all the columns except the first two—but in the case of DOAJ, anyone wanting to do that research can download the full spreadsheet directly. If I was adding any value at all, it would be in expanding Beall’s publisher entries.

What am I missing, and do you have great counter-arguments?

As you’ll see in the Friendfeed discussion, I got a little panicky about some potential Moral Imperative to release these spreadsheets—panicky enough that I pondered shutting down the new project, even though I was already about two-thirds of the way through. If I had had these requests when I began the project or was, say, less than 2,000 rows into it, I might have just shut it down to avoid the issues.

At this point, I believe I’m justified in not wanting to release the spreadsheets. I will not do so without some level of randomizing or redaction, and I don’t believe that redacted spreadsheets would be useful to anybody else.

But there are the questions above. Responses explicitly invited.

[Caveat: I wrote this in the Blog Post portion of Word, but it’s barely been edited at all. It’s probably very rough. A slightly revised version may—or may not—appear in the October 2014 Cites & Insights. If there is an October 2014 Cites & Insights.]

Now, back to the spreadsheets and looking at journals, ten at a time…


Added September 3, 2014:

Two people have asked–in different ways–whether I’d be willing to release a spreadsheet including only the journal names (and publishers) and, possibly, URLs.

Easy answer: Yes, if anybody thought it was worthwhile.

There are three possible sheets:

  • The Beall list, with publishers and the publisher codes I assigned on one page, the journals (with “xxind” as a publisher code for Beall’s separate journal list) and publisher codes on another page. All (I believe) publisher names and most but not all journal names have hyperlinks. (Some publishers didn’t have hyperlinked lists I could figure out how to download.) That one might be mildly useful as an expansion of Beall’s publisher list. (This would be the original Beall list, including MDPI, not the new one I’m using for the new study.)
  • The OASPA list, similarly structured and same comments, lacking MDPI (which is in the new one I’m using for the new study).
  • The new “partial DOAJ” list–DOAJ entries that aren’t in medicine, biology or biomed, that have English as a language code and that aren’t–if I got it right–in the other lists. I don’t honestly see how this could save anybody any time, since all it is is a portion of what’s downloadable directly from DOAJ, albeit in May 2014 rather than now.

If someone wants one of these, let me know–waltcrawford@gmail.com. I may not respond immediately, but I’ll either return the sheet you want as an email attachment or, if there’s more than one request, possibly load it at waltcrawford.name or in Dropbox and send you a link.

 

 

Leave a Reply

Comments will be closed on October 31, 2014.


This blog is protected by dr Dave\\\\\\\'s Spam Karma 2: 104618 Spams eaten and counting...