Archive for September, 2014

C&I and The Project: A quick update

Saturday, September 13th, 2014

Just a quick update, also marking the last blog post I’ll do before I turn another year older…

The October 2014 Cites & Insights…

…will not exist. At least not as a separate issue. Most probably, the next C&I will be an October/November 2014 issue and will appear, with luck, some time in October or early November.

The project…

…is going swimmingly, I think. As of Wednesday, I’d have said “I’m sure”–but the last 300-odd journals in the Beall spreadsheet (the “independent” journals, because I checked them in publisher order) are slow going, as I should have expected.

For a bunch of journals with the same publisher, I can expect similar layout, the same place for APCs (if they’re hidden–some publishers are up front with them), the same possible shortcuts for counting articles. And for some “publishers,” I can anticipate spending very few keystrokes confirming that the “journals” are still nothing more than names on a web page.

The most extreme case of this came very early in the week, when I hit a “publisher” with 426 “journals,” only 20 of them having any articles at all. I usually consider it a good day if I can process 150 journals in all (usually doing 10 in the new DOAJ list followed by 30 in the much longer Beall list: the OASPA list has been done for a while now), an OK day if I process 100, and a great day if I can do 200. With that “publisher”, I managed 460 journals in one day, including 60 from the DOAJ list.

Given that Wednesday’s basically a half day and the weekend counts as a half day in total, here’s where I think I am:

  • I should finish Pass One on the Beall list by the end of this coming week. (Pass Two, a little additional refinement, should only take a week or so for all three lists combined.)
  • I might finish Pass One on the DOAJ list by the end of the following week–let’s say “within September” as a hoped-for deadline.
  • I can actually start working on Part One of the article(s) before the DOAJ list is complete, since that list should only enter into Part Two.

Then come lots of data massaging, thinking about the results, and writing it all up. I have no idea how long that will all take or, for that matter, how long the results will be. I’m aiming for somewhere between two 20-page and two 30-page essays, each constituting a C&I issue. My aim is notoriously weak.

I believe the project will be interesting and revealing. I know I’ve found some journals I might want to go back to and do some reading from…

Swan song?

At the moment, this project feels a little bit like a swan song. I don’t really have any major projects or book projects in mind at the moment. Oh, there are a couple of thousand–check that, 1,500–Diigo-tagged items waiting to be turned into various essays, but that’s just seeing C&I wind down. Or not.

It’s quite possible that new ideas will arise. Or I’ll start reading more, maybe finally join the local Friends and volunteer at the store or whatever. Or…

Anyway: Back to the project. 239 journals on the Beall list and 908 on the DOAJ list left to go; I’m sure a few of the DOAJ ones will disappear in the process (and I just deleted one duplicate title on the Beall list yesterday–a journal entered with two slightly different names but the same URL).

Update as of September 30, 2014:

Pass One is complete.  I chose not to start on the first part of the report until the DOAJ set was complete.

So is Pass Two.

I’ve started in on Part One of the report, and have completed the background material (a lot of it!).

Barring various disasters, Part One should be ready (and published as the October/November 2014 Cites & Insights) before the end of October. Again with the usual caveats, Part Two should be ready in mid-November.

One thing I’ve already found, and should have realized–but note that I really didn’t prejudge likely results. I’d planned to use graphs for a few things, specifically peak articles by journal within a set of journals, APCs for journals and maximum potential one-year revenue per journal.

That won’t happen. I guessed that all three would be power-law graphs. What I didn’t guess was just how extreme those graphs would be: even with logarithmic vertical scales, the graphs were so crowded near the bottom as to be difficult to interpret. I prepared a table equivalent for the first graph attempted (peak articles by journal within the Beall set) and, after looking at both (and dealing with the complexities of full-page-width graphs within a two-column Word document, especially if you want captions for the graphs), I ripped out the first two graphs and will use tables instead. They don’t give as much detail, but they’re much easier to understand and to format.

 

Open Data, Crowdsourcing, Independent Research and Misgivings

Monday, September 1st, 2014

or Why Some Spreadsheets Probably Won’t Become Public

If you think that title is a mouthful, here’s the real title:

Why I’m exceedingly unlikely to make the spreadsheet(s) for my OA journals investigations public, and why I believe it’s reasonable not to do so.

For those of you on Friendfeed, there was a discussion on specifically this issue beginning August 26, 2014. The discussion was inconclusive (not surprisingly, partly because I was being a stubborn old goat), and I continued to think about the issues…even as I continued to build the new spreadsheet(s) for the project I hope to publish in the November and December 2014 Cites & Insights, if all goes well, cross several fingers and toes.

Consider this a public rethinking. Comments are most definitely open for this post (if I didn’t check the box, let me know and I’ll fix it), or you’re welcome to send me email, start a new thread on one of the social media I frequent (for this topic, Friendfeed or the OA community within Google+ seem most plausible), whatever…

Starting point: open data is generally a good idea

There may be some legitimate arguments against open datasets in general, but I’m not planning to make them here. And as you know (I suspect), I’m generally a supporter of open access; otherwise, I wouldn’t be spending hundreds of unpaid hours doing these investigations and writing them up.

All else being equal, I think I’d probably make the spreadsheet(s) available. I’ve done that in the past (the liblog projects, at least some of them).

But all else is rarely equal.

For example:

  • If a medical researcher released the dataset for a clinical trial in a manner that made it possible to determine the identities of the patients, even indirectly, that would be at best a bad thing and more likely actionable malpractice. Such datasets must be thoroughly scrubbed of identifying data before being released.

But of course, the spreadsheets behind Journals, “Journals” and Wannabes: Investigating The List have nothing to do with clinical trials; the explicitly named rows are journals, not people.

That will also be true of the larger spreadsheets in The Current Project.

How much larger? The primary worksheets in the previous project have, respectively, 9,219 [Beall’s Lists] and 1,531 [OASPA] data rows. The new spreadsheets will have somewhere around 6,779 [the subset of Beall’s Lists that was worth rechecking, but not including MDPI journals], exactly 1,378 [the subset of OASPA journals I rechecked, including MDPI journals], and probably slightly fewer than 3,386 [the new “control group,” consisting of non-medicine/non-biology/non-biomed journals in DOAJ that have enough English in the interface for me to analyze them and that aren’t in one of the other sets] rows—a total of somewhere around 11,543. But I’m checking them more deeply; it feels like a much bigger project.

So what’s the problem?

The spreadsheets I’ve built or am building are designed to allow me to look at patterns and counts.

They are not designed for “naming and shaming,” calling out specific journals in any way.

Yes, I did point out a few specific publishers in the July article, but only by quoting portions of their home pages. It was mostly cheap humor. I don’t plan to do it in the new project—especially since most of the journals in the new control group are from institutions with only one or a handful of journals; I think there are some 2,200 publisher names for 3,386 journals.

This is an important point: The July study did not name individual journals and say “stay away from this one, but this one’s OK.” Neither will the November/December study. That’s not something I’m interested in doing on a journal-by-journal or publisher-by-publisher basis. I lack the omniscience and universal subject expertise to even begin to consider such a task. (I question that anybody has such omniscience and expertise; I know that I don’t.) I offered possible approaches to drawing your own judgment, but that’s about it.

Nor do I much want to be the subject of “reanalysis” with regard to the grades I assigned. (I don’t want angry publishers emailing me saying “You gave us a C! We’re going to sue you!” either—such suits may be idiotic, but I don’t need the tsuris.)

Releasing the full spreadsheets would be doing something I explicitly do not want to do: spreading a new set of journal grades. There is no Crawford’s List, and there won’t be one.

For that matter, I’m not sure I much want to see my numbers revalidated: for both projects, I use approximation in some cases, on the basis that approximation will yield good results for the kind of analysis I’m doing. (I’ll explain most of the approximation and shortcuts when I write the articles; I try to be as transparent as possible about methodology.)

For those reasons and others, I would not be willing to release the raw spreadsheets.

Could you randomize or redact the spreadsheets to eliminate these problems?

Well, yes, I could—but (a) that’s more unpaid labor and, more important, (b) I’m not sure the results would be worth much.

Here, for example, are the data label rows and one (modified) data row from part of the current project:

Pub Journal 2014 2013 2012 2011 Start Peak Sum Gr GrF APC Note
pos POS Physics 15 34 14 1 2011 34 64 B $600

The columns, respectively, show: the publisher code (in this case, Pacific Open Science, a nonexistent—I think—publisher I may use to offer hypothetical examples in the discussion. Their slogan: If an article is in our journals, it’s a POS!); the journal name; the number of articles in January-June 2014, all of 2013, all of 2012, all of 2011; the starting year; the peak annual articles; the sum of the four years; the letter grade; a new “GrF”—the letter grade that journals with fewer than 20 articles per year would get if they had more; the article processing charge for a 10-page article; and any note I feel is needed. (If this was the new DOAJ control group, there would be another column, because hyperlinks were stored separately in DOAJ’s spreadsheet; for the one I chose, “POS Physics” is itself a hyperlink—but, of course, there’s no such journal. Don’t try to guess—the actual journal’s not remotely related to physics.)

I’ll probably add a column or two during analysis—e.g., the maximum annual APCs a given journal could have collected, in this case 34×600 or $20,400, and for the new DOAJ group the subject entry to do some further breakdowns.

I could certainly randomize/redact this spreadsheet in such a way that it could be fully re-analyzed—that is, sort the rows on some combination that yields a semi-random output, delete the Pub column, and change the Journal column to a serial number equal to the row. Recipients would have all the data—but not the journal or publisher names. That wouldn’t even take very long (I’d guess ten minutes on a bad day).

Would anybody actually want a spreadsheet like that? Really?

Alternatively, I could delete the Gr and GrF columns and leave the others—but the fact is, people will arrive at slightly different article counts in some significant percentage of cases, depending on how they define “article” and whether they take shortcuts. I don’t believe most journals would be off by more than a few percentage points (and it’s mostly an issue for journals with lots of articles), but that would still be troublesome.

Or, of course, I could delete all the columns except the first two—but in the case of DOAJ, anyone wanting to do that research can download the full spreadsheet directly. If I was adding any value at all, it would be in expanding Beall’s publisher entries.

What am I missing, and do you have great counter-arguments?

As you’ll see in the Friendfeed discussion, I got a little panicky about some potential Moral Imperative to release these spreadsheets—panicky enough that I pondered shutting down the new project, even though I was already about two-thirds of the way through. If I had had these requests when I began the project or was, say, less than 2,000 rows into it, I might have just shut it down to avoid the issues.

At this point, I believe I’m justified in not wanting to release the spreadsheets. I will not do so without some level of randomizing or redaction, and I don’t believe that redacted spreadsheets would be useful to anybody else.

But there are the questions above. Responses explicitly invited.

[Caveat: I wrote this in the Blog Post portion of Word, but it’s barely been edited at all. It’s probably very rough. A slightly revised version may—or may not—appear in the October 2014 Cites & Insights. If there is an October 2014 Cites & Insights.]

Now, back to the spreadsheets and looking at journals, ten at a time…


Added September 3, 2014:

Two people have asked–in different ways–whether I’d be willing to release a spreadsheet including only the journal names (and publishers) and, possibly, URLs.

Easy answer: Yes, if anybody thought it was worthwhile.

There are three possible sheets:

  • The Beall list, with publishers and the publisher codes I assigned on one page, the journals (with “xxind” as a publisher code for Beall’s separate journal list) and publisher codes on another page. All (I believe) publisher names and most but not all journal names have hyperlinks. (Some publishers didn’t have hyperlinked lists I could figure out how to download.) That one might be mildly useful as an expansion of Beall’s publisher list. (This would be the original Beall list, including MDPI, not the new one I’m using for the new study.)
  • The OASPA list, similarly structured and same comments, lacking MDPI (which is in the new one I’m using for the new study).
  • The new “partial DOAJ” list–DOAJ entries that aren’t in medicine, biology or biomed, that have English as a language code and that aren’t–if I got it right–in the other lists. I don’t honestly see how this could save anybody any time, since all it is is a portion of what’s downloadable directly from DOAJ, albeit in May 2014 rather than now.

If someone wants one of these, let me know–waltcrawford@gmail.com. I may not respond immediately, but I’ll either return the sheet you want as an email attachment or, if there’s more than one request, possibly load it at waltcrawford.name or in Dropbox and send you a link.