Archive for the ‘open access’ Category

“Trust Me”: The Other Problem with 87% of Beall’s Lists

Friday, January 29th, 2016

Here’s the real tl;dr: I could only find any discussion at all in Beall’s blog for 230 of the 1,834 journals and publishers in his 2016 lists—and those cases don’t include even 2% of the journals in DOAJ.

Now for the shorter version…

As long-time readers will know, I don’t much like blacklists. I admit to that prejudice belief: I don’t think blacklists are good ways to solve problems.

And yet, when I first took a hard look at Jeffrey Beall’s lists in 2014, I was mostly assessing whether the lists represented as massive a problem as Beall seemed to assert. As you may know, I concluded that they did not.

But there’s a deeper problem—one that I believe applies whether you dislike blacklists or mourn the passing of the Index Librorum Prohibitorum. To wit, Beall’s lists don’t meet what I would regard as minimal standards for a blacklist even if you agree with all of his judgments.

Why not? Because, in seven cases out of eight (on the 2016 lists), Beall provides no case whatsoever in his blog: the journal or publisher is in the lists Just Because. (Or, in some but not most cases, Beall provided a case on his earlier blog but failed to copy those posts.)

Seven cases out of eight: 87.5%. 1,604 journals and publishers of the 1,834 (excluding duplicates) on the 2016 versions have no more than an unstated “Trust me” as the reason for avoiding them.

I believe that’s inexcusable, and makes the strongest possible case that nobody should treat Beall’s lists as being significant. (It also, of course, means that research based on the assumption that the lists are meaningful is fatally flawed.)

The Short Version

Since key numbers will appear first as a blog post on Walt at Random and much later in Cites & Insights, I’ll lead with the short version.

I converted the two lists into an Excel spreadsheet (trivially easy to do), adding columns for “Type” (Pub or Jrn), Case (no, weak, maybe or strong), Beall (URL for Beall’s commentary on this journal or publisher—the most recent or strongest when there’s more than one), and—after completing the hard work—six additional columns. We’ll get to those.

Then I went through Beall’s blog, month by month, post by post. Whenever a post mentioned one or more publishers or independent journals, I pasted the post’s URL into the “Beall” column for the appropriate row, read the post carefully, and filled in the “Case” column based on the most generous reading I could make of Beall’s discussion. (More on this later in the full article, maybe.)

I did that for all four years, 2012 through 2015, and even January 2016.

The results? In 1,604 cases, I was unable to find any discussion whatsoever. (No, I didn’t read all of the comments on the posts. Surely if you’re going to condemn a publisher or journal, you would at least mention your reasons in the body of a post, right?)

If you discard those on the basis that it’s grotesquely unfair to blacklist a journal or publisher without giving any reason why, you’re left with a list of 53 journals and 177 publishers. Giving Beall the benefit of the doubt, I judged that he made no case at all in five cases (the fact that you think a publisher has a “funny name” is no case at all, for example). I think he made a very weak case (e.g., one questionable article in one journal from a multijournal publisher) in 69 cases. I came down on the side of “maybe” 43 times and “strong” 113 times, although it’s important to note that “strong” means that at some point for some journal there were significant issues raised, not that a publisher is forever doomed to be garbage.

Call it 156 reasonable cases—now we’re down to less than 10% of the lists.

Then I looked at the spreadsheets I’m working on for the 2015 project (note here that SPARC has nothing at all to do with this little essay!)—”spreadsheets” because I did this when I was about 35% of the way through the first-pass data gathering. I could certainly identify which publishers had journals in DOAJ, but could only provide article counts for those in the first 35% or so. (In the end, I just looked up the 53 journals directly in DOAJ.)

Here’s what I found.

  • Ignoring the strength of case, Beall’s lists include 209 DOAJ journals—or 1.9% of the total. But of those 209, 85 are from Bentham Open (which, in my opinion, has cleaned up its act considerably) and 49 are from Frontiers Media (which Beall never actually made a case to include in his list, but somehow it’s there). If you eliminate those, you’re down to 75 journals, or 0.7%: Less than one out of every hundred DOAJ journals.
  • For that matter, if you limit the results to strong and maybe cases, the number drops to 37 journals: 0.33%, roughly one in every three hundred DOAJ journals.
  • For journals I’ve already analyzed (and since I’m working by publisher name, that includes most of these—at this writing, January 29, I just finished Hindawi), total articles were just over 16,000 (with more to come on a second pass) in 2015, just under 14,000 in 2014, just over 10,000 in 2013, around 8,500 in 2012, and around 4,500 in 2011.
  • But most of those articles are from Frontiers Media. Eliminating them and Bentham brings article counts down to the 1,700-2,500 range. That’s considerably less than one half of one percent of total serious OA articles.
  • The most realistic counts—those where Beall’s made more than a weak case—show around 150 articles for 2015, around 200-250 for 2013 and 2014, around 1,000 for 2012 and around 780 for 2011 (Those numbers will go up, but probably not by much. There was one active journal that’s mostly fallen by the wayside since 2012.)

The conclusion to this too-long short version: Beall’s lists are mostly the worst possible kind of blacklist: one where there’s no stated reason for things to be included. If you’re comfortable using “trust me” as the basis for a tool, that’s your business. My comment might echo those of Joseph Welch, but that would be mean.

Oh, by the way: you can download the trimmed version of Beall’s lists (with partial article counts for journals in DOAJ, admittedly lacking some of them). It’s available in .csv form for minimum size and maximum flexibility. Don’t use it as a blacklist, though: it’s still far too inclusive, as far as I’m considered.

Modified 1/30: Apparently the original filename yields a 404 error; I’ve renamed the file, and it should now be available. (Thanks, Marika!)

Gold Open Access Journals 2011-2015: A SPARC Project

Friday, January 22nd, 2016

I’m delighted to announce that SPARC (the Scholarly Publishing and Academic Resources Coalition) is supporting the update of Gold Open Access Journals 2011-2015 to provide an empirical basis for evaluating Open Access sustainability models. I am carrying out this project with SPARC’s sponsorship, building from and expanding on The Gold OA Landscape 2011-2014.

The immediate effect of this project is that the dataset for the earlier project is publicly available for use on zenodo.org and on my personal website. The data is public domain, but attribution and feedback are both appreciated.

Here’s what the rest of the project means:

  • I am basing the study on the Directory of Open Access Journals as of December 31, 2015. With eleven duplicates (same URL, different journal names, typically in two languages) removed and reported back to DOAJ, that means a starting point of 10,948 journals. All journals will be accounted for, and as many as feasible will be fully analyzed.
  • The grades and subgrades have been simplified and clarified, and two categories of journal excluded from the 2014 study will now be included (but tagged so that they can be counted separately if desired): journals consisting primarily of conference reports peer-reviewed at the conference level, and journals that require free registration to read articles.
  • I’m visiting all journal sites (and using DOAJ as an additional source) to determine current article processing charges (if any), add 2015 article counts to data carried over from the 2014 project, clean up article counts as feasible, and add 2011-2014 article counts for journals not in the earlier report.
  • Since some journals (typically smaller ones) take some time to post articles, and since some journals will not be analyzed for various reasons (malware, inability to access, difficulty in translating site or counting articles), I’ll be doing a second pass for all those requiring such a pass, starting in April 2016 or after the first pass is complete. My intent is to include as many journals as possible (although existence of malware is an automatic stopping point), although that doesn’t extend to (for example) going through each issue of a weekly journal only available in PDF form.
  • The results will be written up in a form somewhat similar to The Gold OA Landscape 2011-2014, refined based on feedback and discussion.
  • Once the analysis and preparation are complete, the dataset (in anonymized form) will be made freely available at appropriate sites and publicized as available.
  • The PDF version of the final report will be freely available and carry an appropriate Creative Commons license.
  • A paperback version of the final report will be available; details will be announced closer to publication.
  • A shorter version of the final report will appear in Cites & Insights, and it’s likely that notes along the way will also appear there.

My thanks to SPARC for making this possible.

Dataset for The Gold OA Landscape 2011-2014 now available

Thursday, January 21st, 2016

I’m pleased to announce that the anonymized dataset used to prepare The Gold OA Landscape 2011-2014 is now available for downloading and use.

The dataset–an Excel .xlsx spreadsheet with two workbooks–includes 9,824 rows of data, one for each journal graded A through C (and, thus, fully analyzed) in the project. Each row has a dozen columns. The columns are described on the second “data_key” workbook.

I would love to be able to say that this dataset was now on figshare–but after wasting spending far too much time attempting to complete the required fields and publish the dataset, it appears that the figshare mechanisms are at least partly broken. When (if) I receive assurances that the scripts (which fail in current versions of Chrome, Firefox and Internet Explorer) have been fixed, I’ll add the dataset there–although I’d be happy to hear about other no-fee dataset sharing sites that actually work. (It’s possible that figshare just doesn’t much care for free personal accounts any more: I also note that the counts of dataset usage that were previously available have disappeared.)

Update January 22, 2016: This dataset is now available on zenodo.org. (Hat-tip to Thomas Munro.)

As always, the best way to understand the data in this spreadsheet is via either the paperback version or the PDF ebook site-licensed version of The Gold OA Landscape 2011-2014.


Note: This isn’t quite the “Watch This Space” announcement foreshadowed in Cites & Insights 16:2, and it doesn’t mean that sales of the book have suddenly mushroomed. That announcement–which is related to this one–should come in a few days.

By the way, while the dataset consists of facts and is therefore in the public domain, I’d appreciate being told about uses of the spreadsheet and certainly appreciate proper attribution. Send me a note at waltcrawford@gmail.com

I’d also love your suggestions as to ways the presentation in the book could be improved if or when there’s a newer version…leave a comment or, again, send email to waltcrawford@gmail.com

“Trust me”: The Apparent Case for 90% of Beall’s List Additions

Thursday, January 7th, 2016

I’ve tried to stay away from Beall and his Lists, but sometimes it’s not easy.

The final section of the Intersections essay in the January 2016 Cites & Insights recounts a quick “investigation” into the rationales Beall provided for placing 223 publishers on his 2014 list. Go to page 8: it’s the section titled “Lagniappe: The Rationales, Once Over Easy.” I found that I could find any rationale for condemning the publishers in only 35% of cases.

Perhaps too charitably, I assumed that it was because Beall’s blog changed platforms and he didn’t take the time to restore older posts to the new blog.

Then I noted his 2016 lists–which add 230 (or more) publishers and 375 (or more) independent journals to the 2015 lists. I say “or more” because at least one major publisher has been removed via the Star Chamber Appeal Process, even though Beall continues to attack the publisher as unworthy.

In any case: 605 new listings. My recollection is that there haven’t even been close to 605 posts on Beall’s blog in the past year… but I thought I’d check it out.

The results: As far as I can tell, posts during 2015 include around 60 new publishers and journals. (I may have missed a couple of “copycat” journals, so let’s call it 65).

Sixty or 65. Out of 605.

In other words: for roughly 90% of publishers (most of them really “publishers,” I suspect) and journals added to the list, there is no published rationale whatsoever for Beall’s condemnation.

None.

So if you’re wondering why I regard Beall as irrelevant to the reality of open access publishing (which isn’t all sweetness & light, any more than the reality of subscription publishing), there’s one answer.

The Gold OA Landscape 2011-2014: additional info

Thursday, December 17th, 2015

No sales pitch this time. If you think this is valuable information, maybe your library should buy it. Meanwhile:

Questionable Journals by Country

I’m working on a big OA roundup for Cites & Insights, and was noting a ScienceNews article that began with a frequently-stated false statement: that most OA journals charge APCs. But I noted that the author was Indian, and took a look:

For India (as represented by DOAJ-listed journals I regarded as serious), he’s right: a slight majority of the 438 journals do charge APCs (53.7%).

That’s readily available in the book, in Chapter 6.

But I also thought: what about DOAJ-listed journals that I graded “C” (and so did not include in the main analysis)?

Turns out that almost a third of those 312 journals are published in India—it’s second only to the United States (but the United States publishes 996 serious OA journals, compared to India’s 438).

Here’s a table showing the countries that publish more than two journals I graded “C” (for hidden APCs or flat-out lies or impossible peer review turnaround or…):

Country

Count

United States

118

India

100

Pakistan

21

Iran, Islamic Republic of

12

United Kingdom

7

Italy

5

Canada

4

South Korea

4

Brazil

3

Spain

3

Here’s the comparable table for serious journals (grades A and B), including countries with 100 or more such journals (this is a portion of Table 6.1)

Country Journals
United States

996

Brazil

929

United Kingdom

649

Spain

517

Egypt

493

India

438

Germany

315

Romania

285

Italy

277

Iran, Islamic Republic of

269

Turkey

260

Poland

258

Canada

254

Colombia

242

Switzerland

216

France

166

Argentina

146

Mexico

146

Chile

138

Indonesia

136

New Zealand

115

Australia

111

Russian Federation

100

Other Excluded Journals

What about the 777 journals that I excluded for other reasons (see the book for details, Table 2.1 and accompanying text)?

Here’s a table showing the countries with five or more such journals:

Country

Count

United States

130

Brazil

53

Spain

51

India

49

Germany

36

Turkey

36

Egypt

31

Romania

31

Russian Federation

27

Italy

26

United Kingdom

23

Colombia

22

Iran, Islamic Republic of

17

China

16

France

16

Argentina

13

Venezuela, Bolivarian Republic of

13

Mexico

11

Pakistan

10

Chile

9

Portugal

9

Ukraine

9

Canada

8

Poland

8

Switzerland

8

Indonesia

7

Serbia

7

Austria

6

Australia

5

Cuba

5

I wouldn’t attempt to conclude much from this list, since it’s such a hodgepodge of reasons for not fully analyzing the journals. Some aren’t OA as I define it (or are all conference proceedings), some were too difficult to count (mostly because they’re full-issue PDFs), a whole bunch were unreachable, I’ve already posted about the malware-laden ones, and so on…

Status Update

Not much to say here: total so far 10 paperback copies, three PDF ebook copies; three total copies in the last month.

Why you should buy The Gold OA Landscape, for various values of “you.”

Tuesday, December 1st, 2015

The PDF ebook version of The Gold OA Landscape 2011-2014 appeared on September 10, 2015. To date (nine days short of three months), it has sold three copies.

The paperback version of The Gold OA Landscape 2011-2014 appeared on September 11, 2015. To date (eight days short of three months), it has apparently sold nine copies (but it’s possible there are November sales on Amazon, Ingram and Barnes & Noble that haven’t yet been reported).

My September 10, 2015 post offered seven good reasons why libraries, OA advocates and OA publishers might want to buy the book. Those reasons are still a good overall set, so I’ll repeat them here, followed by a little comment on “various values of ‘you’.”

Overall reasons “you” should buy this book

  1. It’s the first comprehensive study of actual publishing patterns in gold OA journals (as defined by inclusion in the Directory of Open Access Journals as of June 15, 2015).
  2. I attempted to analyze all 10,603 journals (that began in 2014 or earlier), and managed to fully analyze 9,824 of them (and I’d say a fully multilingual group would only get 20 more: that’s how many journals I just couldn’t cope with because Chrome/Google didn’t overcome language barriers).
  3. The book offers considerable detail on 9,512 journals (that appear not to be questionable or nonexistent) and what they’ve published from 2011 through 2014, including APC levels, country of publication, and other factors.
  4. It spells out the differences among 28 subject groups (in three major segments) in what’s clearly an extremely heterogeneous field. The 28 pictures of smaller groups of journals are probably more meaningful than the vast picture of the whole field.
  5. If enough people buy this (either edition), an anonymized version of the source spreadsheet will be made available on figshare.
  6. If enough people buy this (either edition), it will encourage continuation of the study for 2015.
  7. Mostly, it’s good to have real data about OA. Do most OA articles involve fees? It depends: in the humanities and social sciences, mostly not; in STEM and biomed, mostly yes. Do most OA journals charge fees? It depends–in biology, yes, but in almost all other fields, no.

Other stuff

Since those first posts, I’ve offered a number of specifics from some chapters (and published an excerpted version of the book–about one-third of it, with none of the graphs–as the October 2015 Cites & Insights. Through yesterday (November 30, 2015), that issue has been downloaded 2,686 times: 1,992 in the single-column format (decidedly preferable in this case), 694 in the traditional print-oriented two-column format.

If one of every ten downloads resulted in a purchased copy (through Lulu), the continuation of this project would be assured for the next two years (assuming I’m still around and healthy). Thar is:

  • An anonymized version of the current spreadsheet would be up on figshare, available for anybody to use.
  • I would carry out a full 2015 study (and update of the existing study) based on DOAJ as of early January 2016.
  • The PDF version of the results would be available for free and the anonymized spreadsheet would be on figshare.
  • The paperback version would be available at a modest price, probably under $25.
  • For 2016 data (DOAJ as of early 2017), the same thing would happen.

Heck, if one out of every fifty downloads resulted in a copy purchased through Lulu, an anonymized version of the current spreadsheet would be up on figshare. (If one out of every ten downloads resulted in an Ingram/B&N/Amazon sale, the spreadsheet would be up and I’d certainly carry out the 2015 study and make the spreadsheet available, but perhaps not the free PDF or minimally-priced paperback.)

Where we are, though, is at a dozen: twelve copies to date. Now, maybe all the advocates and publishers are at the seemingly endless series of open access conferences (or maybe it just seems that way from OATP and twitter coverage) and haven’t gotten around to ordering copies.

It’s interesting (or not) to note that Worldcat.org currently shows that 1,230 libraries own copies of Open Access: What You Need to Know Now. Which is still, to be sure, a relevant and worthwhile quick summary of OA.

“It’s early yet,” I continue saying, albeit more softly each time. I don’t want to believe that there’s simply no real support for this kind of real-world detailed measurement of serious Gold OA in action (where “support” has to be measured by willingness to contribute, not just willingness to download freebies), but it’s not looking real promising at the moment. I’ve already seen that a tiny sampling regarding an aspect of OA done by Respectable Scholars will get a lot more coverage and apparent interest than a complete survey, to the extent that disputing the results of that sampling begins to seem useless.

Various values of “you”

What do I believe the book has to offer “you”? A few possibilities:

You, the academic library

If your institution includes a library school (or an i-school), it almost seems like a no-brainer: $55 buys you campuswide electronic access to an in-depth study of an important part of scholarly publishing’s present and future–showing how big a part it already is, its extent in various fields, how much is or isn’t being spent on it, what countries are most involved in each subject, and on and on…

For the rest of you, it seems like you’d also want to have some detailed knowledge of the state of serious gold OA, since that has the best chance of increasing access to scholarly publications and maybe, perhaps, either slowing down the rate of increase in serials costs or even saving some money.

For that matter, if your library is either starting to publish open access journals or administering an APC support fund, shouldn’t you know more about the state of the field? If, for example, you plan a journal in language and linguistics, it should be useful to know that there are more than 500 of them out there; that almost none of them charge APCs; that of those that do, only six charge more than $353; that the vast majority (350) published no more than 18 articles in 2014; and that Brazil is the hotbed of gold OA publishing in these areas. (Those are just examples.)

You, the open access advocate

You really should have this book at hand when you’re reading various commentaries with dubious “facts” about the extent of OA publishing and charges for that publishing.

Too bad there’s no open access activities in the humanities and social sciences? Nonsense! While most serious gold OA journals in this field are relatively small, there are a lot of them–more than 4,000 in all–and they’ve accounted for more than 95,000 articles in each year 2012-2014, just under 100,000 in 2014. More than three-quarters of those articles didn’t involve APCs, and total potential revenues for the segment didn’t reach $10 million in 2014, but there’s a load of activity–with the biggest chunks in Brazil, the United States, Spain, Romania and Canada, but with 22 nations publishing at least 1,000 articles each in 2014 (Singapore is the 22nd).

Those are just a few data points. This book offers a coherent, detailed overview, and I believe it would make you a more effective advocate. And if you deeply believe that readers should never have to pay for anything involved with open access, well, I invite you to help find me grant or institutional funding, so that can happen.

You, the open access publisher

Surely you should know where your journal(s) stand in comparison to the overall shape of OA and of specific fields? Just as surely, you should want this research to continue–and buying the book (or contributing directly) is the way that will happen. (On the other hand, if you publish one of the 65 journals that appear to have malware, you really, truly need to take care of that–and I’ve already published that list for free.)

You, none of the above

If you’re a library person who cares about OA or about the health of your libraries, but you’re not really an advocate, chances are you stopped reading long ago. If not, well, you should also find the book worthwhile.

Otherwise? I suspect that at this point I’m speaking to an empty room, so I’ll stop.

The next update will probably appear when Amazon/B&N/Ingram figures for November appear in my Lulu stream, some time in the next week or two.

Oh: one side note: I mentioned elsewhere that the back cover of the book is just “OA gold” with the ISBN. What I mean by “OA gold” is the precise shade of gold uses in the OA open-lock logo as it appears in Wikimedia. I downloaded the logo and used Paint.net’s color chooser to make that the background color for the entire cover. (I never was able to get a suitable shade of gold/orange using other techniques.)

Here’s the book cover, in case you weren’t aware of it:

oa14c300

 

One-third of the way there!

Sunday, November 22nd, 2015

With today’s French purchase of a PDF copy of The Gold OA Landscape 2011-2014, and including Cites & Insights Annual purchases, we’re now one-third of the way to the first milestone, at which I’ll upload an anonymized version of the master spreadsheet to figshare. (As with a previous German purchase, I can only assume the country based on Lulu country codes…)

Now an even dozen copies sold.

Lagniappe: The Rationales, Once Over Easy

Friday, November 13th, 2015

[This is the unexpected fourth part of PPPPredatory Article Counts: An Investigation. Before you read this, you should read the earlier posts—Part 1, Part 2 and Part 3—and, of course, the December 2014 Cites & Insights.]

Yes, I know, it’s hard to call it lagniappe when it’s free in any case, I did spend some time doing a first-cut version of the third bullet just above: That is, did I find clear, cogent, convincing explanations as to why publishers were questionable?

I only looked at 223 multijournal publishers responsible for 6,429 journals and “journals” (3,529 of them actual gold OA journals actually publishing articles at some point 2011-2014) from my trimmed dataset (excluding DOAJ journals and some others). I did not look at the singleton journals; that would have more than doubled the time spent on this.

Basically, I searched Scholarly Open Access for each publisher’s name and read the commentary carefully—if there was a commentary. It there was one, I gauged whether it constituted a reasonable case for considering all of that publisher’s journals sketchy at the time the commentary was written, or if it fell short of being conclusive but made a semi-plausible case. (Note the second italicized clause above: journals and publishers do change, but they’re only removed from the list after a mysterious appeals process.)

But I also looked at my own annotations for publishers—did I flag them as definitely sketchy or somewhat questionable, independently of Beall’s comments? I’m fairly tough: if a publisher doesn’t state its APCs or its policy or makes clearly-false statements or promises absurdly short peer review turnaround, those are all red flags.

Beall Results

For an astonishing 65% of the publishers checked there was no commentary. The only occurrences of the publishers’ names were in the lists themselves.

The reason for this is fairly clear. Beall’s blog changed platforms in January 2012, and Beall did not choose to migrate earlier posts. These publishers—which account for 41% of the journals and “journals” in my analysis and 38% of the active Gold OA journals—were presumably earlier additions to the list.

This puts the lie to the claims of some Beall fans that he clearly explains why each publisher or journal is on the list, including comments from those who might disagree. That claim is simply not true for most of the publishers I looked at, representing 38% of the active journals, 23% of the 2014 articles, and 20% of the projected 2014 revenues.

My guess is that it’s worse than this. I didn’t attempt to find individual journals, but although those journals only represent 5% of the active journals I studied, they’re extremely prolific journals, accounting for 38% of 2014 articles (and 13% of 2014 potential revenue).

If Beall was serious about his list being a legitimate tool rather than a personal hobbyhorse, of course, there would be two ongoing lists (one for publishers, one for authors) rather than an annual compilation—and each entry would have two portions: the publisher or journal name (with hyperlink), and a “Rationale” tab linking to Beall’s explanation of why the publisher or journal is there. (Those lists should be pages on the blog, not posts, and I think the latest ones are.) Adding such links, linking to posts would be relatively trivial compared to the overall effort of evaluating publishers, and it would add considerable accountability.

In another 7% of cases, I couldn’t locate the rationale but can’t be sure there isn’t one: some publishers have names composed of such generic words that I could never be quite sure whether I’d missed a post. (The search box doesn’t appear to support phrase searches.) That 7% represents 4% of active journals in the Beall survey, 4% of 2014 articles, but only 1.7% of potential 2014 revenue.

Then there are the others—cases where Beall’s rationale is available. As I read the rationales, I conclude that Beall made a sufficiently strong case for 9% of the publishers, a questionable but plausible case for 11%–and, in my opinion, no real case for 9% of the publishers.

Those figures break out to active journals, articles and revenues as follows:

  • Case made—definitely questionable publishers: 22% of active journals, 11% of 2014 articles, 41% of 2014 potential revenues. (That final figure is particularly interesting.)
  • Questionable—possibly questionable publishers: 16% of active journals, 16% of 2014 articles, 18% of 2014 potential revenues.
  • No case: 14% of active journals, 7% of 2014 articles, 6% of 2014 potential revenues.

If I wanted to suggest an extreme version, I could say that I was able to establish a strong case for definitely questionable publishing for fewer than 12,000 published articles in 2014—in other words, less than 3% of the activity in DOAJ-listed journals.

But that’s an extreme version and, in my opinion, dead wrong, even without noting that it doesn’t allow for any of the independent journals (which accounted for nearly 40,000 articles in 2014) being demonstrably sketchy.

Combined Results

Here’s what I find when I combine Beall’s rationales with my own findings when looking at publishers, ignoring independent journals:

  • Definitely questionable publishers: Roughly 19% of 2014 articles, or about 19,000 within the subset studied, and 44% of potential 2014 revenue, or about $11.4 million. (Note that the article count is still only about 4% of serious OA activity—but if you add in all independent journals, that could go as high as 59,000, or 12%.) Putting it another way, about 31% of articles from multijournal publishers in Beall’s list were in questionable journals.
  • Possibly questionable publishers: Roughly 21% of 2014 articles (34% excluding independent journals) and 21% of 2014 potential revenues.
  • Case not made: Roughly 22% of 2014 articles (36% excluding independent journals) and 22% of 2014 potential revenues.

It’s possible that some portion of that 22% is sketchy but in ways that I didn’t catch—but note that the combined score is the worst of Beall’s rationale or my independent observations.

So What?

I’ve said before that the worst thing about the Shen/Björk study is that it’s based on a fatally flawed foundation, a junk list of one man’s opinions—a man who, it’s increasingly clear, dislikes all open access.

My attempts to determine Beall’s cases confirmed that opinion. In far too many cases, the only available case is “trust me: I’m Jeffrey Beall and I say this is ppppredatory.” Now, of course, I’ve agreed that every journal is ppppredatory, so it’s hard to argue with that—but easy to argue with his advice to avoid all such journals, except as a call to abandon journal publishing entirely.

Which, if you look at it that way, makes Jeffrey Bell a compatriot to Björn Brembs. Well, why not? In his opposition to all Gold OA, he’s already a compatriot to Stevan Harnad: the politics of access makes strange alliances.

Otherwise, I think I’d conclude that perhaps a quarter of articles in non-DOAJ journals are from publishers that are just…not in DOAJ. The journals may be serious OA, but the publishers haven’t taken the necessary steps to validate that seriousness. They’re in a gray area.

Monitoring the Field

Maybe this also says something about the desirability of ongoing independent monitoring of the state of gold OA publishing. When it comes to DOAJ-listed journals, my approach has been “trust but verify”: I checked to make sure the journals actually did make APC policies and levels clear, for example, and that they really were gold OA journals. When it comes to Beall’s lists, my approach was “doubt but verify”: I didn’t automatically assume the worst, but I’ll admit that I started out with a somewhat jaundiced eye when looking at these publishers and journals.

I also think this exercise says something about the need for full monitoring, rather than sampling. The differences between even well-done sampling (and I believe Shen/Björk did a proper job) and full monitoring, in a field so wildly heterogeneous as scholarly journals, is just too large: about three to one, as far as I can tell.

As I’ve made clear, I’d be delighted to continue such monitoring of serious gold OA (as represented by DOAJ), but only if there’s at least a modest level of fiscal support. The door’s still open, either for hired consultation, part-time employment, direct grants or indirect support through buying my books (at this writing, sales are still in single digits) or contributing to Cites & Insights. But I won’t begin another cycle on spec: that single-digit figure [barely two-digit figure, namely 10 copies] after two full months, with no apparent likelihood of any other support, makes it foolhardy to do so. (waltcrawford@gmail.com)

As for the rest of gold OA, the gray area and the questionable publishers, this might be worth monitoring, but I’ve said above that I’m not willing to sign up for another round based on Beall’s lists, and I don’t know of any other good way to do this.

PPPPredatory Article Counts: An Investigation Part 3

Wednesday, November 11th, 2015

If you haven’t read Part 1 and Part 2—and, to be sure, Cites & Insights December 2015—none of this will make much sense.

What would happen if I replicated the sampling techniques actually used in the study (to the extent that I understand the article)?

I couldn’t precisely replicate the sampling. My working dataset had already been stripped of several thousand “journals” and quite a few “publishers,” and I took Beall’s lists a few months before Shen/Björk did. (In the end, the number of journals and “journals” in their study was less than 20% larger than in my earlier analysis, although there’s no way of knowing how many of those journals and “jour*nals” actually published anything. In any case, if the Shen/Björk numbers had been 20% or 25% larger than mine, I would have said “sounds reasonable” and let it go at that.)

For each tier in the Shen/Björk article, I took two samples, both using random techniques, and for all but Tier 4, I used two projection techniques—one based on the number of active true gold OA journals in the tier, one based on all journals in the tier. (For Tier 4, singleton journals, there’s not enough difference between the two to matter much.) In each tier, I used a sample size and technique that followed the description in the Shen/Björk article.

The results were interesting. Extreme differences between the lowest sample and the highest sample include 2014 article counts for Tier 2 (publishers with 10 to 99 journals), the largest group of journals and articles, where the high sample was 97,856 and the low—actually, in this case, the actual counted figure—was 46,770: that’s a 2.09 to 1 range. There’s also maximum revenue, where the high sample for Tier 2 was $30,327,882 while the low sample (once again the counted figure) was $9,574,648: a 3.17 to 1 range—in other words, a range wide enough to explain the difference between my figures and the Shen/Björk figures purely on the basis of sample deviation. (It could be worse: the 2013 projected revenue figures for Tier 2 range from a high of $41,630,771 to a low of $8,644,820, a range of 4.82 to 1! In this case, the actual sum was just a bit higher than the low sample, at $8,797,861.)

Once you add the tiers together, the extremes narrow somewhat. Table 7 shows the low, actual, and high total article projections, noting that the 2013, 2012, and 2011 low and high might not be the actual extremes (I took the lowest and highest 2014 figures for each tier, using the other figures from that sample.) It’s still a broad range for each year, but not quite as broad. (The actual numbers are higher than in earlier tables largely because journals in DOAJ had not been excluded at the time this dataset was captured.)

2014 2013 2012 2011
Low 134,980 130,931 92,020 45,605
Actual 135,294 115,698 85,601 54,545
High 208,325 172,371 136,256 84,282

Table 7. Article projections by year, stratified sample

The range for 2014 is 1.54 to 1: broad, but narrower than in the first two attempts. On the other hand, the range for maximum revenues is larger than in the first two attempts: 2.18 to 1 for 2014 and a very broad 2.46 to 1 for 2013, as in Table 8.

2014 2013
Low $30,651,963 $29,145,954
Actual $37,375,352 $34,460,968
High $66,945,855 $71,589,249

Table 8. Maximum revenue projections, stratified sample

Note that the high figures here are pretty close to those offered by Shen/Björk, whereas the high mark for projected article count is still less than half that suggested by Shen/Björk. (Note also that in Table 7, the actual counts for 2013 and 2012 are actually lower than the lowest combined samples!)

For the graphically inclined, Figure 4 shows the low, actual and high projections for the third sample. This graph is not comparable to the earlier ones, since the horizontal axis is years rather than samples.

Figure 4. Estimated article counts by year, stratified

It’s probably worth noting that, even after removing thousands of “journals” and quite a few publishers in earlier steps, it’s still the case that only 57% of the apparent journals were actual, active gold OA journals—a percentage ranging from 55% for Tier 1 publishers to 61% for Tier 3.

Conclusion

It does appear that, for projected articles, the stratified sampling methodology used by Shen/Björk may work better than using a pure random sample across all journals—but for projected revenues, it’s considerably worse.

This attempt could answer the revenue discrepancy, which in any case is a much smaller discrepancy (as noted, my average APC per article is considerably higher than Shen/Björk’s)—but it doesn’t fully explain the huge difference in article counts.

Overall Conclusions

I do not doubt that Shen/Björk followed sound statistical methodologies, which is quite different than agreeing that the Beall lists make a proper subject for study. The article didn’t identify the number of worthless articles or the amount spent on them; it attempted to identify the number of articles published by publishers Beall disapproved of in late summer 2014, which is an entirely different matter.

That set aside, how did the Shen/Björk sampling and my nearly-complete survey wind up so far apart? I see four likely reasons:

  • While Shen/Björk accounted for empty journals (but didn’t encounter as many as I did), they did not control for journals that have articles but are not gold OA journals. That makes a significant difference.
  • Sampling is not the same as counting, and the more heterogeneous the universe, the more that’s true. That explains most of the differences, I believe (on the revenue side, it can explain all of them).
  • The first two reasons, enhanced by two or three months’ of additional listings, combined to yield a much higher estimate of active journals than my survey: more than twice as many.
  • The second reason resulted in a much higher average number of articles per journal than in my survey (53 as compared to 36), which, combined with the doubled number of journals, neatly explains the huge difference in article counts.

The net result is that, while Shen/Björk carried out a plausible sampling project, the final numbers raise needless alarm about the extent of “bad” articles. Even if we accept that all articles in these projections are somehow defective, which I do not, the total of such articles in 2014 appears to be considerably less than one-third of the number of articles published in serious gold OA journals (that is, those in DOAJ)—not the “nearly as many” the study might lead one to assume.

No, I do not plan to do a followup survey of publishers and journals in the Beall lists. It’s tempting in some ways, but it’s not a good use of my time (or anybody else’s time, I suggest). A much better investigation of the lists would focus on three more fundamental issues:

  • Is each publisher on the primary list so fundamentally flawed that every journal in its list should be regarded as ppppredatory?
  • Is each journal on the standalone-journal list actually ppppredatory?
  • In both cases, has Beall made a clear and cogent case for such labeling?

The first two issues are far beyond my ken; as to th first, there’s a huge difference between a publisher having some bad journals and it making sense to dismiss all of that publisher’s journals. (See my longer PPPPredatory piece for a discussion of that.)

Then there’s that final bullet…

[In closing: for this and the last three posts—yes, including the Gunslingers one—may I once again say how nice Word’s post-to-blog feature is:? It’s a template in Word 2013, but it works the same way, and works very well.]

PPPPredatory Article Counts: An Investigation Part 2

Monday, November 9th, 2015

If you haven’t already done so, please read Part 1—otherwise, this second part of an eventual C&I article may not make much sense.

Second Attempt: Untrimmed List

The first five samples in Part 1 showed that even a 20% sample could yield extreme results over a heterogeneous universe, especially if the randomization was less than ideal.

Given that the most obvious explanation for the data discrepancies is sampling, I thought it might be worth doing a second set of samples, this time each one being a considerably smaller portion of the universe. I decided to use the same sample size as in the Shen/Björk study, 613 journals—and this time the universe was the full figshare dataset Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. I assigned RAND() on each row, froze the results, then sorted by that column. Each sample was 613 journals; I took 11 samples (leaving 205 journals unsampled but included in the total figures). I adjusted the multipliers.

More than half of the rows in the full dataset have no articles (and no revenue). You could reasonably expect extremely varied results—e.g., it wouldn’t be improbable for a sample to consist entirely of no-article journals or of all journals with articles (thus yielding numbers more than twice as high as one might expect).

In this case, the results have a “dog that did not bark in the night” feel to them. Table 3 shows the 11 sample projections and the total article counts.

Sample 2014 2013 2012 2011
6 88,165 72,034 40,801 20,473
10 91,186 75,025 50,820 31,523
5 95,338 93,886 56,047 27,893
4 97,313 80,978 51,343 36,039
1 99,956 97,153 83,606 52,983
2 105,967 87,468 50,617 20,880
7 106,693 72,658 40,119 29,055
Total 121,311 99,994 64,325 34,543
9 127,747 100,653 73,326 32,075
3 140,292 122,128 77,958 36,634
8 154,754 114,360 79,323 35,632
11 160,591 143,312 91,011 53,579

Table 3. Article projections by year, 9% samples

Although these are much smaller samples (percentagewise) over a much more heterogeneous dataset, the range of results is, while certainly wider than for samples 6-10 in the first attempt, not dramatically so. Figure 3 shows the same data in graphic form (using the same formatting as Figure 1 for easy comparison).

Figure 3. Estimated article counts by year, 9% sample

The maximum revenue samples show a slightly wider range than the article count projections: 2.01 to one, as compared to 1.82 to 1. That’s still a fairly narrow range. Table 4 shows the figures, with samples in the same order as for article projections (Table 3).

Sample 2014 2013
6 $27,904,972 $24,277,062
10 $32,666,922 $27,451,802
5 $19,479,393 $20,980,689
4 $24,975,329 $25,507,720
1 $30,434,762 $30,221,463
2 $30,793,406 $25,461,851
7 $30,725,482 $21,497,760
Total $31,863,087 $28,537,554
9 $29,642,696 $24,386,137
3 $39,104,335 $41,415,454
8 $36,654,201 $29,382,149
11 $35,420,001 $34,710,583

Table 4. Estimated Maximum Revenue, 9% samples

As with maximum revenue, so with cost per article: a broader range than for the last five samples (and total) in the first attempt, but a fairly narrow range, at 1.75 to 1, as shown in Table 5.

Sample 2014 2013
6 $316.51 $337.02
10 $358.25 $365.90
5 $204.32 $223.47
4 $256.65 $315.00
1 $304.48 $311.07
2 $290.59 $291.10
7 $287.98 $295.88
Total $262.66 $285.39
9 $232.04 $242.28
3 $278.73 $339.12
8 $236.85 $256.93
11 $220.56 $242.20

Table 5. APC per article, 9% samples and total

Rather than providing redundant graphs, I’ll provide one more table: the average (mean) articles per journal (ignoring empty journals), in Table 6.

Sample 2014 2013 2012 2011
6 27.85 20.59 20.66 16.79
10 29.35 20.75 22.73 23.10
1 30.06 25.54 38.13 38.41
5 30.26 27.63 27.18 20.88
4 31.46 22.86 23.42 29.90
2 33.94 24.79 25.08 15.14
7 34.66 20.68 20.17 22.48
Total 36.80 27.47 30.08 25.51
3 42.01 34.90 38.63 27.13
9 42.10 29.75 35.82 26.30
8 43.86 31.25 38.20 26.39
11 47.88 40.12 47.13 38.04

Table 6. Average articles per journal, 9% samples

Note that Table 6 is arranged from lowest average in 2014 to highest average; the rows are not (quite) in the same order as in Tables 3-5. The range here is 1.72 to 1, an even narrower range. On the other hand, sample 11 does show an average articles per journal figure that’s not much below the Shen/Björk estimate.

One More Try

What would happen if I assigned a new random number (again using RAND()) in each row and reran the eleven samples?

The results do begin to suggest that the difference between my nearly-full survey and the Shen/Björk study could be due to sample variation. To wit, this time the article totals range from 64,933 to 169,739, a range of 2.61 to 1. The lowest figure is less than half the actual figure, so it’s not entirely implausible that a sample could yield a number three times as high.

The total revenue range is also wider, from $22.7 million to $41.3 million, a range of 1.82 to 1. It’s still a stretch to get to $74 million, but not as much of a stretch. And in this set of samples, the cost per article ranges from $169.22 to $402.89, a range of 2.38 to 1. I should also note that at least one sample shows a mean articles-per-journal figure of 51.5, essentially identical to the Shen/Björk figure, and that $169.22 is similar to the Shen/Björk figure.

Conclusion

Sampling variation with 9% samples could yield numbers as far from the full-survey numbers as those in the Shen/Björk article, although for total article count it’s still a pretty big stretch.

But that article was using closer to 5% samples—and they weren’t actually random samples. Could that explain the differences?

[More to come? Maybe, maybe not.]