Archive for March, 2016

Recovery: a short, slow post

Thursday, March 31st, 2016

Since I’ve left notes elsewhere saying I’m mostly offline for the next [1:n] days [where n is indeterminate], I thought a little more detail might be in order:

  • The surgery: removing a Schwannoma (a benign nerve sheath tumor) from my right forearm–a visible bump perhaps 1.2″ long and 1.3″(?) high, determined to be benign by a January needle biopsy-which also irritated the lump and caused it to grow.
  • When? Tuesday, March 29, around 3:30 pm Stanford Hospital, Dr. David G. Mohler (who did a great job).
  • Pain? Not bad: of the allowed 2-pills-each-6-hours allowed, I needed 1 pill Tuesday afternoon, 1 at bedtime, 1 Wednesday a,m.(10 hrs later) and, since then 1/2 pill every eight hours, Good chance I’ll stop altogether tomorrow. (OTOH, my metabolism appears to be tough on drugs: the whole-arm nerve block, intended to last 8-12 hours, lasted about 3.5 hours. General anesthesia not wanted or needed,)
  • Problems? Maybe just reality: after trauma to the tendons and muscles and nerves in the arm, my fingers aren’t back to normal. (But gripping, etc. is pretty much OK.)

So I mostly need to let my right arm rest until the swelling goes down. I’ve seen how hard it is to work online without instinctively using both hands. So I’m mostly staying off. Two fingers are starting to come back to semi-normal; the rest could take a day, or three, or a week.

Otherwise? There’s leeway enough in The Big Project; I’m feeling good enough that I went for the daily walk around the 1,3-mile block with my wife today.

Thanks for the expressions of concern

The Great Paskins Mystery

Saturday, March 26th, 2016

And now for something completely different (and this post won’t be publicized–it’s for people who subscribe or otherwise come here on their own, what others call “both of my readers”).

Who is Paskins?

More to the point, why do hundreds of spammers think that my name is Paskins–even though, to the best of my knowledge, “waltcrawford@gmail.com” doesn’t really sound much like Paskins and I’m the only one with this email address.Life i

It’s a curiosity. Fortunately, Gmail traps almost all of it as spam. But it’s a, well, curious curiosity: why would all these people be trying to contact Paskins at my email address?

Life is full of curiosities, I guess.

Making the case (a follow-up post)

Saturday, March 26th, 2016

A while back, I wrote a post explaining why the dataset for Gold Open Access Journals 2011-2015 will not include journal names and publishers, and invited people to send me email explaining possible positive use cases if that decision was changed.

I’ve received one such email so far, resulting in an exchange of email; I’ve saved it for later consideration.

Meanwhile, a tweetstorm has erupted that seems to say that my work is useless if I don’t provide the full data. Apparently the other post is too long to read (or didn’t get read), so here’s a slightly different and shorter version–but you still need to read the other post before you respond.

  • If somebody attempted to replicate the research starting in, say, July 2016, the results will be different for some significant number of journals, for several reasons (some of them having to do with what gets counted, some of them having to do with delays in posting, some because journals that yield 404s in March may not in July or vice-versa).
  • Somebody out to snipe or discredit will also look at individual journals and disagree with my choice of which of 28 broad subjects to assign it to; in quite a few cases, more than one choice is reasonable.
  • I’m very interested in use cases–cases where useful additional research would be possible based on a non-anonymized spreadsheet. (In some such cases, the dataset will be made available to the group or person–I’ve already done that for the previous dataset.) If there are convincing cases, I’d talk to SPARC about whether it makes sense to open up the data completely. And hope that I don’t spend the rest of the year dealing with a stream of “But THIS NUMBER’S WRONG, so your whole study’s worthless” or “But THIS JOURNAL’S REALLY ABOUT X, so your whole study’s worthless” or variants of that.
  • Email (to waltcrawford@gmail.com) calmly suggesting positive use cases will be dealt with politely and taken into account. Head-on attacks 140 characters at a time are, shall we say, less likely to persuade me. (Well, they might persuade me never to get involved in this kind of project again, so if that’s your motive…)
  • Oh, and by the way: This isn’t about hiding methodology. I’ve never done so, and don’t plan to start now.

I’ll be off the air entirely for several days beginning the evening of March 28, so email may not receive quick responses at that point. Meanwhile, I’d like to get back to getting something done.

In partial defense of Jeffrey Beall

Friday, March 25th, 2016

Not in defense of his lists, which I regard as a bad idea in theory and fatally flawed in practice, for reasons I’ve documented (most recently here but elsewhere over time).

But…I’ve seen some stuff on another blog lately that bothers me.

  • I do not for a minute believe that Jeffrey Beall wrote the supposed email I’ve seen that suggests a listed publisher would be re-evaluated for $5,000. That email was written using English-as-a-third-language grammar; it’s just not plausible as coming from Beall.
  • I truly dislike the notion that a doctorate is the minimum qualification for scholarship. But then, I would, wouldn’t I (since my pinnacle of academic achievement is a BA and a handful of credits toward an MA).
  • I also dislike the notion that state colleges are somehow disreputable. My own degree comes from a state institution, and I’ll match its credentials with anybody.

The same blog had an interesting fisking of one of Beall’s sillier anti-OA papers. I had tagged it toward a future Cites & Insights essay on access and ethics. But after seeing this other stuff…I won’t link to or source from this particular blog.  Heck, I’ve been the subject of Beall’s ad hominem attacks; doesn’t mean I have to support that sort of thing.

Cites & Insights 16:3 (April 2016) available

Wednesday, March 23rd, 2016

The April 2016 Cites & Insights (16:3) is now available for downloading at http://citesandinsights.info/civ16i3.pdf

That print-oriented version is 30 pages long. If you’re planning to read online or on an ereader, you may prefer the single-column 6″ x 9″ version, 59 pages long, available at http://citesandinsights.info/civ16i3on.pdf

While much of this issue has appeared as a series of posts in this blog, the final section of the lead essay is new, as is the fourth essay; the final section reprints 35 pages of The Gold OA Landscape 2011-2014 to serve as context for a portion of the first essay.

This issue includes:

The Front: Gold Open Access Journals 2011-2015: A SPARC Project pp. 1-8

Remember the “watch this space” note in the February-March “The Front”? This is what it was about. This essay includes the key announcement, a partial list of changes from the 2011-2014 project, a partial checkpoint prepared when I was halfway through the first pass, a section asking for possible “changes for the better” in the analysis and writeup (note that this year’s PDF ebook will be free and OA, since it’s a SPARC-sponsored project), another section discussing the planned anonymization of the (free) spreadsheet when analysis is done–and, new to this issue, a second checkpoint prepared at the end of the first journal pass.

The Front (also): Readership Notes  pp. 8-9

Notes on the most frequently downloaded issues in Volume 15 and the most frequently downloaded issues overall.

Intersections: “Trust Me”: The Other Problem with Beall’s Lists  pp. 9-11

As far as I can tell, Jeffrey Beall provides no evidence whatsoever–not even his classic “this publisher has a funny name”–for seven out of eight journals and publishers on his 2016 lists. This piece, which has a little additional material beyond the original post, goes into some detail.

The Back  pp. 11-12

Not precisely filler to get an even number of pages, but…OK, so these three mini-rants are mostly filler to get an even number of pages.

The Gold OA Landscape 2011-2014, pp. 39-73   following page 12

I’m including chapters 5 (starting dates), 6 (country of publication), 7 (segments and subjects), 8 (biology and medicine) and 9 (biology) to provide more context for my invitation to suggest better ways to analyze and present the 2011-2015 data. Please note that these pages appear precisely as they would in the PDF ebook if you’re looking at the online 6″ x 9″ version (since the book’s 6″x9″), but are reduced very slightly for the print-oriented version (to 5.5″x8.5″) so that two book pages will fit on one printed page.

Next issue?

I did not label this the April-May 2016 issue. Whether there’s a May issue in late April or early May, or a May-June issue later in May, depends on a number of factors having mostly to do with Gold Open Access Journals 2011-2015.

Why Anonymize?

Monday, March 14th, 2016

The project plan for Gold Open Access Journals 2011-2015 calls for me to make an anonymized version of the master spreadsheet freely available—and as soon as the project was approved, I made an anonymized version of the 2014 spreadsheet available.

Two people raised the question “Why anonymized?”—why don’t I just post the spreadsheet including all data, instead of removing journal names, publishers and URLs and adding a simple numeric key to make rows unique?

The short answer is that doing so would shift the focus of the project from patterns and the overall state of gold OA to specifics, and lead to arguments as to whether the data was any good.

Maybe that’s all the answer that’s needed. Although I counted very little use of the 2014 spreadsheet in January and February 2016, it’s been used more than 900 times in the first half of March 2016—but I have received no more queries as to why it’s anonymized. For any analysis of patterns, of course, journal names don’t matter. But maybe a slightly longer answer is useful.

That longer answer begins with the likelihood that some folks would try to undermine the report’s findings by claiming that the data is full of errors—and the certainty that such folks could find “errors” in the data.

Am I being paranoid in suggesting that this would happen? Thanks to Kent Anderson, I can safely say that I’m not, since within a day or two of my posting the spreadsheet, he tweeted this:

Anderson didn’t say “Am I misunderstanding?” or “Clarification needed” or any alternative suggesting that more information was needed. No: he went directly on the attack with “Errors exist” (by completely misreading the dataset, as it happens: around 500 gold OA journals began publication, usually not as OA, between 1853 and 1994).

It’s not wrong, it’s just different

To paraphrase Ed and Patsy Bruce (they wrote the song, even though Willie Nelson and Waylon Jennings had the big hit with it)…

If somebody else—especially someone looking to “invalidate” this research—goes back to do new counts on some number of journal, they will probably get different numbers in a fair number of cases.

Why? Several reasons:

  • Inclusiveness: Which items in journals—and which journals—do you include? The 2014 count tended to be more exclusive when I had to count each article individually; the 2015 count tends to include all items subject to some form of review, including book reviews and case reports. Similarly, the 2015 report includes journals that consist of (reviewed) conference reports (although I’ll note the subset of such journals).
  • Shortcuts: I did not in fact look at each and every item in each and every issue of each and every journal, compare it to that journal’s own criteria for reviewed or peer-reviewed, and determine whether to include it. To do that, I’d estimate that a single year’s count would require at least 2,000 hours exclusive of determining APC existence and levels and all other overhead—and, of course, a five-year study would require four times that amount (fewer journals and articles in earlier years). That’s not plausible under any circumstances. Instead, I used every shortcut that I could: publication-date indexes or equivalent for SciELO, J-Stage, MDPI, Dove and several others; DOI numbers when it’s clear they’re assigned sequentially; numbered tables of contents; Find (Ctrl-F) counts for distinctive strings (e.g., “doi:” or “HTML”) after quick scans of the contents tables. For the latter, I did make rough adjustments for clear editorials and other overhead.
  • Estimates: In some cases—fewer in 2015 than in 2014, but still some—I had to estimate, as for instance when a journal with no other way of counting publishes hundreds of articles each year and maintains page numbering throughout a dozen issues. I might count the articles in one or two issues, determine an average article length, and estimate the year’s total count based on that length. I also used counts from DOAJ in many cases, when those counts were plausible based on manual sampling.
  • Errors: I’m certain that my counts are off by one or two in some cases; that happens.
  • Late additions: Some journals, especially those that are issue-oriented and still include print versions, post online articles very late. Even though I’m retesting all cases where the “final issue” of 2015 seemed to be missing when checked in January-March 2016, it’s nearly certain that somebody looking at some journals in, say, August 2016 will find more 2015 articles than I did.

In practice, I doubt that any two counts of a thousand or more OA journals will yield precisely the same totals. I’d guess that I’m very slightly overcounting articles in some journals that provide convenient annual totals—and undercounting articles in some journals that don’t.

For the analysis I’m doing, and for any analysis others are likely to do, these “errors” shouldn’t matter. If somebody claimed that overall numbers were 5% lower or 5% higher, my response would be that this is quite possible. I doubt that the differences in counts would be greater than that, at least for any aggregated data.

Making the case

If you believe I’m wrong—that there are real, serious, worthwhile research cases where only the unanonymized version will do—let me know (waltcrawford@gmail.com).

Obviously, anonymized datasets aren’t unusual; I don’t know of any open science advocate who would seriously argue that medical data should be posted with patient names or that libraries should keep enough data to be able to do analysis such as “people who borrowed X also borrowed Y.” In practice, there may be special use cases for an open copy of the master spreadsheet. On the other hand, except for the list of journals flagged as having malware on their sites, I’ll be doing my analysis with the anonymized spreadsheet—it’s what’s needed for this work, and won’t distract me with individual journal titles and how I might feel about their publishers.

Changes for the Better?

Friday, March 11th, 2016

Do you have suggestions that will help make Gold Open Access Journals 2011-2015 even better than The Gold OA Landscape 2011-2014?

If so, now’s the time to suggest them—any time between now and May 1, 2016 (the earliest date I’m likely to start working on data analysis and the book manuscript). Suggestions should go to me at waltcrawford@gmail.com.

You say you haven’t purchased the book yet, either in paperback or PDF ebook form? You still can, and it will still be worthwhile when the new book comes out.

Alternatively, you can get a good idea of the general approach and tables used in the excerpt published as the October 2015 Cites & Insights, although that version lacks any graphs.

I’ve appended pages 39 through 73 of The Gold OA Landscape 2011-2014 to the end of the next Cites & Insights, probably out in late March 2016. That segment includes almost all varieties of tables and graphs used in the book. The online version is an exact replica of the print book; the print (two-column) version is just slightly smaller, so that four pages of the 6×9″ book fit on each 8.5×11″ sheet rather than having loads of waste space.

The Basics

Basically, the data used for analysis includes for each journal the year reported to DOAJ (which is not always the start of publication), the country of publication (again as reported to DOAJ), one of 28 subjects and three broad areas that I’ve derived from the subjects, keywords and journal/article titles for the journals, and the data I went looking for: whether there’s an author-side fee (usually called an APC or Article Processing Charge but they’re not all that straightforward) and how much it is, and the number of published articles (and similar items) for each year 2011 through 2015. There’s also a two-letter code (or “grade and subgrade”) for special cases, but most journals don’t have special codes. I also derive some measures: the peak article number during the five years and, if there are APCs, the maximum revenue for 2014 (2015 this time around).

Last year, after an overall discussion of maximum revenues, overall article counts, and special cases, I looked at journals by annual article volume for each of the three major areas (which have very different characteristics), fee and revenue levels, starting dates for free and APC-charging journals, and a number of measures by country of publication. I also provided one set of pie charts breaking down free and pay journals by major area.

For each of the three major areas (biomed, STEM, and humanities and social sciences) I looked at cost per article by year, journal and article volume by year (and free percentage of each), revenue brackets for journals, article volume brackets, and APC level brackets. A bar graph showed free and pay articles for each year.

For each subject within an area—using the revenue and article volume brackets appropriate for that area—I showed journals and articles for each year (and free percentage), the free/pay article bar graph, journals by article volume (and percent free), journals and articles by APC range, a line graph showing free and pay journals by starting date, and a table showing the countries with the most published 2014 articles for that subject.

At the end of the book, I provided a few subject summaries—percentage of free journals, percentage of articles in no-fee journals, change in article volume, change in free article volume, journals changing article volume by 10% or more from 2013 to 2014, average APC per paid article and for all articles, median APC per paid article and all articles, and the median, first quartile, and third quartile articles per journal for 2014.

Data Changes for 2015

There’s another year of data—more journals and more data for existing journals. I’m taking some pains to include more journals (and defining “articles” somewhat more inclusively and, I believe, consistently).

Beyond that, there may be one new category of derived data: a publisher category—breaking journals down into what seem to be five reasonable groups based on what’s in the DOAJ publisher field:

  • Academic, published by universities and colleges, including university presses.
  • Society, published by societies and associations.
  • Traditional*, published by publishers that also publish subscription journals.
  • OA publisher*, published by groups that don’t appear to publish subscription journals (and that publish at least a handful of journals—see notes on the “*” below)
  • Miscellany, everybody else.

About the asterisk on Traditional and OA publisher: there are 5,983 different “publisher names” (that is, distinct character strings in the DOAJ publisher field). That’s more than one “publisher” for every two journals. The vast majority of those, all but 919, publish a single DOAJ-listed journal.

I think it’s reasonable to limit the two “publisher” categories (Traditional and OA) to firms that publish at least a handful of journals, and lump the others in as Miscellany. (If nothing else, it makes this added data feasible.)

What’s a handful? If the cutoff is “five or more,” it involves only 221 publishers in all, accounting for 4,128 journals. If the cutoff is “four or more,” it involves 316 publishers—and, naturally, adds 380 journals for a total of 4,508. Dropping it to “three or more journals” brings us up to 486 publishers and 5,018 journals. I suspect the final cutoff will be either four or five

Incidentally, if I add that column, it will be in the anonymized spreadsheet made publicly available at the end of this project. Other than the list of journal titles apparently containing malware, it will be possible for anybody else to replicate any or all of the graphs and numbers in the book.

Probable Changes

I believe it will make sense to devote a chapter to publisher categories—whether there are major differences in article volume, APC charges (existence and amount) and, possibly, domination in some countries.

I’m fairly certain the pie charts will go away: I don’t believe they add enough information to justify the space. I could be convinced otherwise. (Note that the print paperback will, of necessity, be black and white to keep production costs down, so really attractive pie charts aren’t feasible.)

Possible Changes

What else should I consider? Which existing tables and graphs don’t seem especially valuable—and what would work better? (Assume that this year’s book can be larger than last, but not enormously larger.)

I’m open to suggestions, which I’ll discuss with my contacts at SPARC (and I anticipate suggestions from SPARC as well).

I would offer a free PDF version of this year’s book as a reward for good suggestions—but since this year’s PDF version will be free in any case, that’s