The Problems with Shen/Björk’s “420,000”

[This is Chapter 4 of Gray OA 2012-016, a comprehensive study of journals on “the lists.”]


Cenyu Shen and Bo-Christer Björk published “‘Predatory’ open access: a longitudinal study of article volumes and market characteristics” in BMC Medicine 13, October 2015. (I’m bemused at the idea that this is a medical paper, but that’s a separate discussion.) I started questioning the paper’s conclusions as soon as it appeared, and continued to do so in my blog and in Cites & Insights.

Quite apart from the apparent assumption that Beall’s word is gospel when it comes to journals being “predatory”—an assumption I found, and find, appalling—I thought the numbers were implausible. The authors used a sample of 613 journals to assert that there were around 8,000 active “predatory” journals in 2014 and that those journals published around 420,000 articles in 2014 (up from around 310,000 in 2013 and 212,000 in 2012).

Being presented with a case for the implausibility of the numbers, the authors responded that the article was peer-reviewed and used proper statistical methods. As I was writing this, I took the time to read open reviewer comments on the article and the authors’ responses. Notably, all of the reviewers said they weren’t qualified to review the statistics—and there were certainly questions raised about the assumption that to be on Beall’s list was to be predatory.

The authors are right about one thing: looking at all the journals is a ridiculously large task. But that task showed that gray journals are just as heterogeneous as I thought they were, making it easy for a 6% sample to be wildly off base.

The First Cut

Now that I’ve done the work, the first note could be that the article’s 2014 figure has the first two digits reversed: it’s closer to 240,000 than to 420,000. Of course, the authors did not accidentally transpose digits; they came up with too-large results. Instead of 420,000 for 2014, 310,000 for 2013 and 212,000 for 2012, the figures should be 255,000 for 2014, 189,000 for 2013 and 125,000 for 2012 (rounding to the nearest thousand)—consistently between 59% and 61% of the article’s figures.

“255,000 questionable as compared to 560,000 DOAJ” isn’t as astonishing as “nearly as many predatory as not.” That 420,000 figure has been cited a lot, mostly by critics of open access in general.

But there’s more to say…

The Second Cut

The authors were working from an earlier and much smaller pair of Beall lists than those that I worked from. I used the Wayback Machine to download versions of the list as close as possible to the versions they used (in both cases, later and presumably a little larger). Flagging publisher and journal listings from those earlier versions yield the figures in Table 4.1, including “UA” journals but excluding X-coded ones.












Table 4.1. Journals and articles based on Beall lists at time of Shen/ Björk article

Now we’re down from 8,000 active journals to 2,692—and from 420,000 articles to just under 114,000. The percentages are still clustered: now the real numbers are 26% to 28% of those reported in the article. Even if you added 50% to my figures to account for a few dozen not-fully-counted journals (rather than the 5% to 10% I consider plausible), you’d be nowhere near 200,000, let alone 420,000. And, of course, 114,000 is a pretty small fraction of 560,000—just over one-fifth.

Even those numbers involve the odd assumption that Beall’s tagging is definitive. What happens if we reduce the universe to those articles and publishers where Beall’s actually made a case?

The Final Cut

2014 2013 20*12








Table 4.2. Journals and articles where Beall made a case

Table 4.2 shows the results: fewer than 30,000 articles in 2014—about 7% of the article’s estimate. (The 2012 and 2013 figures are 6% to 7% of the article’s estimates.) These are cases where Beall not only listed a publisher or journal at the time the authors downloaded the lists, but actually made a case for the journals or publishers being questionable or “predatory.”

Those numbers are too low—but they’re arguably what should have emerged from the study. As noted in Chapter 3, I believe realistic numbers are on the order of 120,000 for 2014; 90,000 for 2013; and 56,000 for 2012—still a lot of articles appearing in questionable journals, but not quite so alarmingly high.

What Went Wrong?

How could these two scholars be so far off? First there’s the assertion that all journals on Beall’s lists are actually predatory. Second, the “stratified” random sampling method involves some tricky assumptions, based on a “suspicion” that was “verified” by sampling all of ten journals—the suspicion “that journals from small publishers often publish a much higher number of articles than those of large publishers.”

The sampling used in this study yielded a much lower percentage of empty journals than my 100% survey. The article estimates that 67% of listings represent active journals; my 100% survey (admittedly of a larger list) shows 40% active journals. That’s an enormous difference: instead of 8,000 active journals from the smaller list, you wind up with around 4,800. That’s probably about right (I show 5,988—but that’s from a much larger list).

Beyond that, it appears that the sheer heterogeneity of journals makes projection from a small sample so dicey as to be useless. Unfortunately, I believe that to be the case.

Comments are closed.