PPPPredatory Article Counts: An Investigation, Part 1

If you read all the way through the December 2015 essay Ethics and Access 2015 (and if you didn’t, you really should!), you may remember a trio of items in The Lists! section relating to “‘Predatory’ open access: a longitudinal study of article volumes and market characteristics” (by Cenyu Shen and Bo-Christer Björk in BMC Medicine). Briefly, the two scholars took Beall’s lists, looked at 613 journals out of nearly 12,000, and concluded that “predatory” journals published 420,000 articles in 2014, a “stunning” increase from 50,000 articles in 2010—and that there were around 8,000 “active” journals that seemed to meet Jeffrey Beall’s criteria for being PPPPredatory (I’m using the short form).

I was indeed stunned by the article—because I had completed a complete survey of the Beall lists and found far fewer articles: less than half as many. Indeed, I didn’t think there were anywhere near 8,000 active journals either—if “active” means “actually publishing Gold OA articles” I’d put the number at roughly half that.

The authors admitted that the article estimate was just that—that it could be off by as much as 90,000. Of course, news reports didn’t focus on that: they focused on the Big Number.

Lars Bjørnshauge at DOAJ questioned the numbers and, in commenting on one report, quoted some of my own work. I looked at that work more carefully and concluded that a good estimate for 2014 was around 135,000 articles, or less than one-third of the Shen/Björk number—and my estimate was based on a nearly 100% actual count, not an estimate from around 6% of the journals.

As you may also remember, Björk dismissed these full-survey numbers with this statement:

“Our research has been carefully done using standard scientific techniques and has been peer reviewed by three substance editors and a statistical editor. We have no wish to engage in a possibly heated discussion within the OA community, particularly around the controversial subject of Beall’s list. Others are free to comment on our article and publish alternative results, we have explained our methods and reasoning quite carefully in the article itself and leave it there.”

I found that response unsatisfying (and find that I’ll approach Björk’s work with a much more jaundiced eye in the future). As I expected, the small-sample report continued (continues?) to get wider publicity, while my near-complete survey got very little.

The situation continued to bother me, because I don’t doubt that the authors did follow appropriate methodology and wonder how the results could be so wrong. How could they come up with more than twice as many active OA PPPPredatory journals and more than three times as many articles?

So I thought I’d look at my own work a little more, to see whether sampling could account for the wild deviation.

First Attempt: The Trimmed List

I began by taking my own copy of Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. The keys on each row of that 6,948-row spreadsheet are designed to be random. The spreadsheet includes not only the active Gold OA journals but also 3,673 others, to wit:

2,045 that had not published any articles between 2011 and 2014, including eight that had explicitly ceased.
183 that were hybrid journals, not gold OA.
413 that weren’t really OA by my standards.
279 that were difficult to count (more on those later).
753 that were either unreachable or wholly unworkable.

There were two additional exclusions: I deleted around 1,100 journals (at least 300 of them empty ) from publishers that wouldn’t provide hyperlinked lists of their journal titles—and I deleted journals that are in DOAJ because there were even more reasons than usual to doubt the PPPPredatory label. (Note that the biggest group of that double-listed category, MDPI, has more recently been removed from Beall’s list.)

I wound up with 3,275 active gold OA journals, what I’ll call “secondary OA journals,” since I think of the DOAJ members as “serious OA journals” and don’t have a good alternative term.

As I started reworking the numbers, I thought there should be some accounting for the opaque publishers and journals. In practice, I knew from some extended sampling that most journals from opaque publishers were either empty or very small—and my sense is that most opaque journals (usually opaque because there are no online tables of contents, only downloadable PDF issues, but sometimes because there really aren’t streams of articles as such) are also fairly small. But still, they should be included. Since these two groups (excluding the 300-odd journals from opaque publishers that I knew were empty) added up to 32% of the count of active journals, I multiplied article and revenue counts by 1.32. (I think this is too high, but feel it’s better to err on the side that will get closer to the Shen/Björk numbers.)

I did not factor in the DOAJ-included numbers, but the total of those and other already-counted additional articles (doubling 2014 since I only counted January-June) is around 43,000 for 2014; around 39,000 for 2013; around 37,000 for 2012; and around 28,000 for 2011. You can add them to the counts below if you wish—although I don’t believe these represent questionable articles.

Methodology

Since 613 was the sample size in the Shen/Björk article, I took a similar size sample as a starting point, then adjusted it so I could take five samples that would, among them, include everything: that is, a sample size of 655 journals.

For each sample (sorting by the pseudorandom key, then starting from the beginning and working my way down), I took the article count for each year, multiplying by appropriate factors, and the revenue counts for 2013 and 2014 (determined by multiplying the 2014 APC by the annual article counts, then applying the appropriate multipliers—I didn’t go back before 2013 because APCs were too likely to have changed). I calculated average APC per article for 2014 and 2013 by straight division—and also calculated the average article count (not including zero-count journals because the cells were blank rather than zero) and median article count (also excluding zero-count journals). I also calculated standard deviation just for amusement.

“Zero-count journals? Didn’t you eliminate zero-count journals?” I eliminated journals that had no articles in any year 2011-2014, but quite a few journals have articles in some years and not in others—including, of course, newish journals. For example, there were only 2,393 journals with articles in the first half of 2014; 2,714 in 2013; 1,557 in 2012 and 996 in 2011.

I also calculated the same figures for the full set.

Looking at the results, I was a little startled by the wide range, given that these samples were 20% of the whole: the 2014 projected article totals (doubling actual article counts, of course) ranged from 5,755 to 180,229! Now, of course, even that highest count is still much less than half of the Shen/Björk count—and just a bit over half if you add in the DOAJ-listed count.

So I added another column and assigned a random number to each row, using Excel’s RAND function, then froze the results and took a new set of five samples. The results were much narrower in range: 99,713 to 136,660. The actual total: 121,311 (including the 1.32 multiplier but not DOAJ numbers).

Table 1 shows the projected (or actual) article totals year-by-year and sample-by-sample, sorted so the lowest 2014 projection appears first. Note that samples 1-5 use the assigned pseudorandom keys, while samples 6-10 use Excel RAND function for randomization. Clearly, the latter yields more plausible results.

Sample	2014	2013	2012	2011
4	5,755	21,734	15,959	10,223
5	91,067	85,734	66,594	51,473
8	99,713	84,797	55,209	33,733
7	115,368	91,964	57,664	27,595
Total	121,311	99,994	64,325	34,543
6	123,050	104,808	57,295	22,605
9	131,762	106,181	82,790	53,869
10	136,660	112,220	68,666	34,914
3	159,284	121,097	75,933	27,628
1	170,148	138,890	87,371	56,027
2	180,299	132,515	75,768	27,364

Table 1. Estimated article counts by year

Adding the 43,000-odd articles from DOAJ-listed journals would bring these totals (ignoring samples 1-5) to around 143,000 to around 180,000 articles, with the most likely value around 165,000 articles: more than one-third of the Shen/Björk estimate but a lot less than half.

Note that “120,000 plus or minus 25,000” as an estimate actually covers all five samples that used the RAND-function randomization. Figure 1 shows the same data as Table 1, but in graphic form.

Figure 1. Estimated article counts by year

How much revenue might those articles have brought in, and what’s the APC per article? Keeping the order of samples the same as for Table 1 and Figure 1, Table 2 and Figure 2 show the maximum revenue (not allowing for waivers and discounts).

Sample	2014	2013
4	$2,952,893	$10,473,269
5	$1,677,496	$3,322,988
8	$30,184,480	$23,906,771
7	$35,939,416	$35,825,909
Total	$31,863,087	$28,537,554
6	$31,010,206	$27,926,897
9	$31,165,754	$29,071,218
10	$31,015,578	$25,956,975
3	$82,610,167	$65,930,614
1	$34,247,360	$32,892,328
2	$37,827,517	$30,068,570

Table 2. Estimated maximum revenue, 2014 and 2013

This time there are two extremely low figures and one extremely high figure—with samples 6 through 10 all within $4.1 million of the actual maximum figure (for 2014: for 2013, the deviation is $7.3 million). Compare the $31.86 million calculated costs here with the $74 million estimated by Shen/Björk: the full-survey number is less than half as much.

Figure 2 shows the same information in graphical form.

Figure 2. Estimated maximum revenue, 2014 and 2013

Looking at APC per article, we run into an anomaly: where the Shen/Björk estimate is $178 for 2014, the calculated average for the full survey is considerably higher, $262.66. The range of the ten samples is from a low of $18.42 to a high of $513.08, but the five “good” samples range from $226.95 to $302.71, a reasonably narrow range.

Finally, consider the mean (average) number of articles per journal in 2014, in journals that had articles. The Shen/Björk figure is around 50; my survey yields 36.8. In fact, I show only 327 journals with at least 25 articles in the first half of 2014 (and only 267 with at least 50 articles in all of 2013).

The median is even lower—12 articles, or six in the first half—and that’s not too surprising. The standard deviation in most years was at least twice the average: as usual, these journals are very heterogeneous. How heterogeneous? In the first half of 2014, three journals had more than 1,000 articles each (but fewer than 1,300); six more had at least 500 articles; 16 had 250 to 499 articles—but at the same time, only 819 of the total had at least 11 articles in the first half of 2014, and only 1,544 had at least five articles in those six months.

Conclusion

I could find no way to get from these samples to the Shen/Björk figures. Not even close. They show too many active journals by roughly a factor of two, too many articles by a factor of close to three, and too much revenue by a factor of two—and too many articles per journal as well.

[Part 1 of 2 or 3…]

Note: This and following posts will also appear, probably in somewhat revised form, in the January 2016 issue of Cites & Insights.

This entry was posted on Monday, November 9th, 2015 at 3:53 pm and is filed under open access. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

The Gold OA Landscape 2011-2014 paperback: $60.00, ISBN 978-1-329-54762-9 also ebook* $55.00, ISBN 978-1-329-54713-1	Your Library Is... A Collection of Public Library Sayings [paperback: $16.99] also ebook* $8.99	Beyond the Damage: Circulation, Coverage and Staffing ($45 color paperback) or $45 PDF ebook*
Also available: Cites & Insights Annuals (annually since 2006) Open Access and Libraries	Library 2.0: A Cites & Insights Reader Balanced Libraries: Thoughts on Continuity and Change

Walt at Random