PPPPredatory Article Counts: An Investigation Part 2

If you haven’t already done so, please read Part 1—otherwise, this second part of an eventual C&I article may not make much sense.

Second Attempt: Untrimmed List

The first five samples in Part 1 showed that even a 20% sample could yield extreme results over a heterogeneous universe, especially if the randomization was less than ideal.

Given that the most obvious explanation for the data discrepancies is sampling, I thought it might be worth doing a second set of samples, this time each one being a considerably smaller portion of the universe. I decided to use the same sample size as in the Shen/Björk study, 613 journals—and this time the universe was the full figshare dataset Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. I assigned RAND() on each row, froze the results, then sorted by that column. Each sample was 613 journals; I took 11 samples (leaving 205 journals unsampled but included in the total figures). I adjusted the multipliers.

More than half of the rows in the full dataset have no articles (and no revenue). You could reasonably expect extremely varied results—e.g., it wouldn’t be improbable for a sample to consist entirely of no-article journals or of all journals with articles (thus yielding numbers more than twice as high as one might expect).

In this case, the results have a “dog that did not bark in the night” feel to them. Table 3 shows the 11 sample projections and the total article counts.

Sample 2014 2013 2012 2011
6 88,165 72,034 40,801 20,473
10 91,186 75,025 50,820 31,523
5 95,338 93,886 56,047 27,893
4 97,313 80,978 51,343 36,039
1 99,956 97,153 83,606 52,983
2 105,967 87,468 50,617 20,880
7 106,693 72,658 40,119 29,055
Total 121,311 99,994 64,325 34,543
9 127,747 100,653 73,326 32,075
3 140,292 122,128 77,958 36,634
8 154,754 114,360 79,323 35,632
11 160,591 143,312 91,011 53,579

Table 3. Article projections by year, 9% samples

Although these are much smaller samples (percentagewise) over a much more heterogeneous dataset, the range of results is, while certainly wider than for samples 6-10 in the first attempt, not dramatically so. Figure 3 shows the same data in graphic form (using the same formatting as Figure 1 for easy comparison).

Figure 3. Estimated article counts by year, 9% sample

The maximum revenue samples show a slightly wider range than the article count projections: 2.01 to one, as compared to 1.82 to 1. That’s still a fairly narrow range. Table 4 shows the figures, with samples in the same order as for article projections (Table 3).

Sample 2014 2013
6 $27,904,972 $24,277,062
10 $32,666,922 $27,451,802
5 $19,479,393 $20,980,689
4 $24,975,329 $25,507,720
1 $30,434,762 $30,221,463
2 $30,793,406 $25,461,851
7 $30,725,482 $21,497,760
Total $31,863,087 $28,537,554
9 $29,642,696 $24,386,137
3 $39,104,335 $41,415,454
8 $36,654,201 $29,382,149
11 $35,420,001 $34,710,583

Table 4. Estimated Maximum Revenue, 9% samples

As with maximum revenue, so with cost per article: a broader range than for the last five samples (and total) in the first attempt, but a fairly narrow range, at 1.75 to 1, as shown in Table 5.

Sample 2014 2013
6 $316.51 $337.02
10 $358.25 $365.90
5 $204.32 $223.47
4 $256.65 $315.00
1 $304.48 $311.07
2 $290.59 $291.10
7 $287.98 $295.88
Total $262.66 $285.39
9 $232.04 $242.28
3 $278.73 $339.12
8 $236.85 $256.93
11 $220.56 $242.20

Table 5. APC per article, 9% samples and total

Rather than providing redundant graphs, I’ll provide one more table: the average (mean) articles per journal (ignoring empty journals), in Table 6.

Sample 2014 2013 2012 2011
6 27.85 20.59 20.66 16.79
10 29.35 20.75 22.73 23.10
1 30.06 25.54 38.13 38.41
5 30.26 27.63 27.18 20.88
4 31.46 22.86 23.42 29.90
2 33.94 24.79 25.08 15.14
7 34.66 20.68 20.17 22.48
Total 36.80 27.47 30.08 25.51
3 42.01 34.90 38.63 27.13
9 42.10 29.75 35.82 26.30
8 43.86 31.25 38.20 26.39
11 47.88 40.12 47.13 38.04

Table 6. Average articles per journal, 9% samples

Note that Table 6 is arranged from lowest average in 2014 to highest average; the rows are not (quite) in the same order as in Tables 3-5. The range here is 1.72 to 1, an even narrower range. On the other hand, sample 11 does show an average articles per journal figure that’s not much below the Shen/Björk estimate.

One More Try

What would happen if I assigned a new random number (again using RAND()) in each row and reran the eleven samples?

The results do begin to suggest that the difference between my nearly-full survey and the Shen/Björk study could be due to sample variation. To wit, this time the article totals range from 64,933 to 169,739, a range of 2.61 to 1. The lowest figure is less than half the actual figure, so it’s not entirely implausible that a sample could yield a number three times as high.

The total revenue range is also wider, from $22.7 million to $41.3 million, a range of 1.82 to 1. It’s still a stretch to get to $74 million, but not as much of a stretch. And in this set of samples, the cost per article ranges from $169.22 to $402.89, a range of 2.38 to 1. I should also note that at least one sample shows a mean articles-per-journal figure of 51.5, essentially identical to the Shen/Björk figure, and that $169.22 is similar to the Shen/Björk figure.

Conclusion

Sampling variation with 9% samples could yield numbers as far from the full-survey numbers as those in the Shen/Björk article, although for total article count it’s still a pretty big stretch.

But that article was using closer to 5% samples—and they weren’t actually random samples. Could that explain the differences?

[More to come? Maybe, maybe not.]

Comments are closed.