Archive for the ‘open access’ Category

One-third of the way there!

Sunday, November 22nd, 2015

With today’s French purchase of a PDF copy of The Gold OA Landscape 2011-2014, and including Cites & Insights Annual purchases, we’re now one-third of the way to the first milestone, at which I’ll upload an anonymized version of the master spreadsheet to figshare. (As with a previous German purchase, I can only assume the country based on Lulu country codes…)

Now an even dozen copies sold.

Lagniappe: The Rationales, Once Over Easy

Friday, November 13th, 2015

[This is the unexpected fourth part of PPPPredatory Article Counts: An Investigation. Before you read this, you should read the earlier posts—Part 1, Part 2 and Part 3—and, of course, the December 2014 Cites & Insights.]

Yes, I know, it’s hard to call it lagniappe when it’s free in any case, I did spend some time doing a first-cut version of the third bullet just above: That is, did I find clear, cogent, convincing explanations as to why publishers were questionable?

I only looked at 223 multijournal publishers responsible for 6,429 journals and “journals” (3,529 of them actual gold OA journals actually publishing articles at some point 2011-2014) from my trimmed dataset (excluding DOAJ journals and some others). I did not look at the singleton journals; that would have more than doubled the time spent on this.

Basically, I searched Scholarly Open Access for each publisher’s name and read the commentary carefully—if there was a commentary. It there was one, I gauged whether it constituted a reasonable case for considering all of that publisher’s journals sketchy at the time the commentary was written, or if it fell short of being conclusive but made a semi-plausible case. (Note the second italicized clause above: journals and publishers do change, but they’re only removed from the list after a mysterious appeals process.)

But I also looked at my own annotations for publishers—did I flag them as definitely sketchy or somewhat questionable, independently of Beall’s comments? I’m fairly tough: if a publisher doesn’t state its APCs or its policy or makes clearly-false statements or promises absurdly short peer review turnaround, those are all red flags.

Beall Results

For an astonishing 65% of the publishers checked there was no commentary. The only occurrences of the publishers’ names were in the lists themselves.

The reason for this is fairly clear. Beall’s blog changed platforms in January 2012, and Beall did not choose to migrate earlier posts. These publishers—which account for 41% of the journals and “journals” in my analysis and 38% of the active Gold OA journals—were presumably earlier additions to the list.

This puts the lie to the claims of some Beall fans that he clearly explains why each publisher or journal is on the list, including comments from those who might disagree. That claim is simply not true for most of the publishers I looked at, representing 38% of the active journals, 23% of the 2014 articles, and 20% of the projected 2014 revenues.

My guess is that it’s worse than this. I didn’t attempt to find individual journals, but although those journals only represent 5% of the active journals I studied, they’re extremely prolific journals, accounting for 38% of 2014 articles (and 13% of 2014 potential revenue).

If Beall was serious about his list being a legitimate tool rather than a personal hobbyhorse, of course, there would be two ongoing lists (one for publishers, one for authors) rather than an annual compilation—and each entry would have two portions: the publisher or journal name (with hyperlink), and a “Rationale” tab linking to Beall’s explanation of why the publisher or journal is there. (Those lists should be pages on the blog, not posts, and I think the latest ones are.) Adding such links, linking to posts would be relatively trivial compared to the overall effort of evaluating publishers, and it would add considerable accountability.

In another 7% of cases, I couldn’t locate the rationale but can’t be sure there isn’t one: some publishers have names composed of such generic words that I could never be quite sure whether I’d missed a post. (The search box doesn’t appear to support phrase searches.) That 7% represents 4% of active journals in the Beall survey, 4% of 2014 articles, but only 1.7% of potential 2014 revenue.

Then there are the others—cases where Beall’s rationale is available. As I read the rationales, I conclude that Beall made a sufficiently strong case for 9% of the publishers, a questionable but plausible case for 11%–and, in my opinion, no real case for 9% of the publishers.

Those figures break out to active journals, articles and revenues as follows:

  • Case made—definitely questionable publishers: 22% of active journals, 11% of 2014 articles, 41% of 2014 potential revenues. (That final figure is particularly interesting.)
  • Questionable—possibly questionable publishers: 16% of active journals, 16% of 2014 articles, 18% of 2014 potential revenues.
  • No case: 14% of active journals, 7% of 2014 articles, 6% of 2014 potential revenues.

If I wanted to suggest an extreme version, I could say that I was able to establish a strong case for definitely questionable publishing for fewer than 12,000 published articles in 2014—in other words, less than 3% of the activity in DOAJ-listed journals.

But that’s an extreme version and, in my opinion, dead wrong, even without noting that it doesn’t allow for any of the independent journals (which accounted for nearly 40,000 articles in 2014) being demonstrably sketchy.

Combined Results

Here’s what I find when I combine Beall’s rationales with my own findings when looking at publishers, ignoring independent journals:

  • Definitely questionable publishers: Roughly 19% of 2014 articles, or about 19,000 within the subset studied, and 44% of potential 2014 revenue, or about $11.4 million. (Note that the article count is still only about 4% of serious OA activity—but if you add in all independent journals, that could go as high as 59,000, or 12%.) Putting it another way, about 31% of articles from multijournal publishers in Beall’s list were in questionable journals.
  • Possibly questionable publishers: Roughly 21% of 2014 articles (34% excluding independent journals) and 21% of 2014 potential revenues.
  • Case not made: Roughly 22% of 2014 articles (36% excluding independent journals) and 22% of 2014 potential revenues.

It’s possible that some portion of that 22% is sketchy but in ways that I didn’t catch—but note that the combined score is the worst of Beall’s rationale or my independent observations.

So What?

I’ve said before that the worst thing about the Shen/Björk study is that it’s based on a fatally flawed foundation, a junk list of one man’s opinions—a man who, it’s increasingly clear, dislikes all open access.

My attempts to determine Beall’s cases confirmed that opinion. In far too many cases, the only available case is “trust me: I’m Jeffrey Beall and I say this is ppppredatory.” Now, of course, I’ve agreed that every journal is ppppredatory, so it’s hard to argue with that—but easy to argue with his advice to avoid all such journals, except as a call to abandon journal publishing entirely.

Which, if you look at it that way, makes Jeffrey Bell a compatriot to Björn Brembs. Well, why not? In his opposition to all Gold OA, he’s already a compatriot to Stevan Harnad: the politics of access makes strange alliances.

Otherwise, I think I’d conclude that perhaps a quarter of articles in non-DOAJ journals are from publishers that are just…not in DOAJ. The journals may be serious OA, but the publishers haven’t taken the necessary steps to validate that seriousness. They’re in a gray area.

Monitoring the Field

Maybe this also says something about the desirability of ongoing independent monitoring of the state of gold OA publishing. When it comes to DOAJ-listed journals, my approach has been “trust but verify”: I checked to make sure the journals actually did make APC policies and levels clear, for example, and that they really were gold OA journals. When it comes to Beall’s lists, my approach was “doubt but verify”: I didn’t automatically assume the worst, but I’ll admit that I started out with a somewhat jaundiced eye when looking at these publishers and journals.

I also think this exercise says something about the need for full monitoring, rather than sampling. The differences between even well-done sampling (and I believe Shen/Björk did a proper job) and full monitoring, in a field so wildly heterogeneous as scholarly journals, is just too large: about three to one, as far as I can tell.

As I’ve made clear, I’d be delighted to continue such monitoring of serious gold OA (as represented by DOAJ), but only if there’s at least a modest level of fiscal support. The door’s still open, either for hired consultation, part-time employment, direct grants or indirect support through buying my books (at this writing, sales are still in single digits) or contributing to Cites & Insights. But I won’t begin another cycle on spec: that single-digit figure [barely two-digit figure, namely 10 copies] after two full months, with no apparent likelihood of any other support, makes it foolhardy to do so. (

As for the rest of gold OA, the gray area and the questionable publishers, this might be worth monitoring, but I’ve said above that I’m not willing to sign up for another round based on Beall’s lists, and I don’t know of any other good way to do this.

PPPPredatory Article Counts: An Investigation Part 3

Wednesday, November 11th, 2015

If you haven’t read Part 1 and Part 2—and, to be sure, Cites & Insights December 2015—none of this will make much sense.

What would happen if I replicated the sampling techniques actually used in the study (to the extent that I understand the article)?

I couldn’t precisely replicate the sampling. My working dataset had already been stripped of several thousand “journals” and quite a few “publishers,” and I took Beall’s lists a few months before Shen/Björk did. (In the end, the number of journals and “journals” in their study was less than 20% larger than in my earlier analysis, although there’s no way of knowing how many of those journals and “jour*nals” actually published anything. In any case, if the Shen/Björk numbers had been 20% or 25% larger than mine, I would have said “sounds reasonable” and let it go at that.)

For each tier in the Shen/Björk article, I took two samples, both using random techniques, and for all but Tier 4, I used two projection techniques—one based on the number of active true gold OA journals in the tier, one based on all journals in the tier. (For Tier 4, singleton journals, there’s not enough difference between the two to matter much.) In each tier, I used a sample size and technique that followed the description in the Shen/Björk article.

The results were interesting. Extreme differences between the lowest sample and the highest sample include 2014 article counts for Tier 2 (publishers with 10 to 99 journals), the largest group of journals and articles, where the high sample was 97,856 and the low—actually, in this case, the actual counted figure—was 46,770: that’s a 2.09 to 1 range. There’s also maximum revenue, where the high sample for Tier 2 was $30,327,882 while the low sample (once again the counted figure) was $9,574,648: a 3.17 to 1 range—in other words, a range wide enough to explain the difference between my figures and the Shen/Björk figures purely on the basis of sample deviation. (It could be worse: the 2013 projected revenue figures for Tier 2 range from a high of $41,630,771 to a low of $8,644,820, a range of 4.82 to 1! In this case, the actual sum was just a bit higher than the low sample, at $8,797,861.)

Once you add the tiers together, the extremes narrow somewhat. Table 7 shows the low, actual, and high total article projections, noting that the 2013, 2012, and 2011 low and high might not be the actual extremes (I took the lowest and highest 2014 figures for each tier, using the other figures from that sample.) It’s still a broad range for each year, but not quite as broad. (The actual numbers are higher than in earlier tables largely because journals in DOAJ had not been excluded at the time this dataset was captured.)

2014 2013 2012 2011
Low 134,980 130,931 92,020 45,605
Actual 135,294 115,698 85,601 54,545
High 208,325 172,371 136,256 84,282

Table 7. Article projections by year, stratified sample

The range for 2014 is 1.54 to 1: broad, but narrower than in the first two attempts. On the other hand, the range for maximum revenues is larger than in the first two attempts: 2.18 to 1 for 2014 and a very broad 2.46 to 1 for 2013, as in Table 8.

2014 2013
Low $30,651,963 $29,145,954
Actual $37,375,352 $34,460,968
High $66,945,855 $71,589,249

Table 8. Maximum revenue projections, stratified sample

Note that the high figures here are pretty close to those offered by Shen/Björk, whereas the high mark for projected article count is still less than half that suggested by Shen/Björk. (Note also that in Table 7, the actual counts for 2013 and 2012 are actually lower than the lowest combined samples!)

For the graphically inclined, Figure 4 shows the low, actual and high projections for the third sample. This graph is not comparable to the earlier ones, since the horizontal axis is years rather than samples.

Figure 4. Estimated article counts by year, stratified

It’s probably worth noting that, even after removing thousands of “journals” and quite a few publishers in earlier steps, it’s still the case that only 57% of the apparent journals were actual, active gold OA journals—a percentage ranging from 55% for Tier 1 publishers to 61% for Tier 3.


It does appear that, for projected articles, the stratified sampling methodology used by Shen/Björk may work better than using a pure random sample across all journals—but for projected revenues, it’s considerably worse.

This attempt could answer the revenue discrepancy, which in any case is a much smaller discrepancy (as noted, my average APC per article is considerably higher than Shen/Björk’s)—but it doesn’t fully explain the huge difference in article counts.

Overall Conclusions

I do not doubt that Shen/Björk followed sound statistical methodologies, which is quite different than agreeing that the Beall lists make a proper subject for study. The article didn’t identify the number of worthless articles or the amount spent on them; it attempted to identify the number of articles published by publishers Beall disapproved of in late summer 2014, which is an entirely different matter.

That set aside, how did the Shen/Björk sampling and my nearly-complete survey wind up so far apart? I see four likely reasons:

  • While Shen/Björk accounted for empty journals (but didn’t encounter as many as I did), they did not control for journals that have articles but are not gold OA journals. That makes a significant difference.
  • Sampling is not the same as counting, and the more heterogeneous the universe, the more that’s true. That explains most of the differences, I believe (on the revenue side, it can explain all of them).
  • The first two reasons, enhanced by two or three months’ of additional listings, combined to yield a much higher estimate of active journals than my survey: more than twice as many.
  • The second reason resulted in a much higher average number of articles per journal than in my survey (53 as compared to 36), which, combined with the doubled number of journals, neatly explains the huge difference in article counts.

The net result is that, while Shen/Björk carried out a plausible sampling project, the final numbers raise needless alarm about the extent of “bad” articles. Even if we accept that all articles in these projections are somehow defective, which I do not, the total of such articles in 2014 appears to be considerably less than one-third of the number of articles published in serious gold OA journals (that is, those in DOAJ)—not the “nearly as many” the study might lead one to assume.

No, I do not plan to do a followup survey of publishers and journals in the Beall lists. It’s tempting in some ways, but it’s not a good use of my time (or anybody else’s time, I suggest). A much better investigation of the lists would focus on three more fundamental issues:

  • Is each publisher on the primary list so fundamentally flawed that every journal in its list should be regarded as ppppredatory?
  • Is each journal on the standalone-journal list actually ppppredatory?
  • In both cases, has Beall made a clear and cogent case for such labeling?

The first two issues are far beyond my ken; as to th first, there’s a huge difference between a publisher having some bad journals and it making sense to dismiss all of that publisher’s journals. (See my longer PPPPredatory piece for a discussion of that.)

Then there’s that final bullet…

[In closing: for this and the last three posts—yes, including the Gunslingers one—may I once again say how nice Word’s post-to-blog feature is:? It’s a template in Word 2013, but it works the same way, and works very well.]

PPPPredatory Article Counts: An Investigation Part 2

Monday, November 9th, 2015

If you haven’t already done so, please read Part 1—otherwise, this second part of an eventual C&I article may not make much sense.

Second Attempt: Untrimmed List

The first five samples in Part 1 showed that even a 20% sample could yield extreme results over a heterogeneous universe, especially if the randomization was less than ideal.

Given that the most obvious explanation for the data discrepancies is sampling, I thought it might be worth doing a second set of samples, this time each one being a considerably smaller portion of the universe. I decided to use the same sample size as in the Shen/Björk study, 613 journals—and this time the universe was the full figshare dataset Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. I assigned RAND() on each row, froze the results, then sorted by that column. Each sample was 613 journals; I took 11 samples (leaving 205 journals unsampled but included in the total figures). I adjusted the multipliers.

More than half of the rows in the full dataset have no articles (and no revenue). You could reasonably expect extremely varied results—e.g., it wouldn’t be improbable for a sample to consist entirely of no-article journals or of all journals with articles (thus yielding numbers more than twice as high as one might expect).

In this case, the results have a “dog that did not bark in the night” feel to them. Table 3 shows the 11 sample projections and the total article counts.

Sample 2014 2013 2012 2011
6 88,165 72,034 40,801 20,473
10 91,186 75,025 50,820 31,523
5 95,338 93,886 56,047 27,893
4 97,313 80,978 51,343 36,039
1 99,956 97,153 83,606 52,983
2 105,967 87,468 50,617 20,880
7 106,693 72,658 40,119 29,055
Total 121,311 99,994 64,325 34,543
9 127,747 100,653 73,326 32,075
3 140,292 122,128 77,958 36,634
8 154,754 114,360 79,323 35,632
11 160,591 143,312 91,011 53,579

Table 3. Article projections by year, 9% samples

Although these are much smaller samples (percentagewise) over a much more heterogeneous dataset, the range of results is, while certainly wider than for samples 6-10 in the first attempt, not dramatically so. Figure 3 shows the same data in graphic form (using the same formatting as Figure 1 for easy comparison).

Figure 3. Estimated article counts by year, 9% sample

The maximum revenue samples show a slightly wider range than the article count projections: 2.01 to one, as compared to 1.82 to 1. That’s still a fairly narrow range. Table 4 shows the figures, with samples in the same order as for article projections (Table 3).

Sample 2014 2013
6 $27,904,972 $24,277,062
10 $32,666,922 $27,451,802
5 $19,479,393 $20,980,689
4 $24,975,329 $25,507,720
1 $30,434,762 $30,221,463
2 $30,793,406 $25,461,851
7 $30,725,482 $21,497,760
Total $31,863,087 $28,537,554
9 $29,642,696 $24,386,137
3 $39,104,335 $41,415,454
8 $36,654,201 $29,382,149
11 $35,420,001 $34,710,583

Table 4. Estimated Maximum Revenue, 9% samples

As with maximum revenue, so with cost per article: a broader range than for the last five samples (and total) in the first attempt, but a fairly narrow range, at 1.75 to 1, as shown in Table 5.

Sample 2014 2013
6 $316.51 $337.02
10 $358.25 $365.90
5 $204.32 $223.47
4 $256.65 $315.00
1 $304.48 $311.07
2 $290.59 $291.10
7 $287.98 $295.88
Total $262.66 $285.39
9 $232.04 $242.28
3 $278.73 $339.12
8 $236.85 $256.93
11 $220.56 $242.20

Table 5. APC per article, 9% samples and total

Rather than providing redundant graphs, I’ll provide one more table: the average (mean) articles per journal (ignoring empty journals), in Table 6.

Sample 2014 2013 2012 2011
6 27.85 20.59 20.66 16.79
10 29.35 20.75 22.73 23.10
1 30.06 25.54 38.13 38.41
5 30.26 27.63 27.18 20.88
4 31.46 22.86 23.42 29.90
2 33.94 24.79 25.08 15.14
7 34.66 20.68 20.17 22.48
Total 36.80 27.47 30.08 25.51
3 42.01 34.90 38.63 27.13
9 42.10 29.75 35.82 26.30
8 43.86 31.25 38.20 26.39
11 47.88 40.12 47.13 38.04

Table 6. Average articles per journal, 9% samples

Note that Table 6 is arranged from lowest average in 2014 to highest average; the rows are not (quite) in the same order as in Tables 3-5. The range here is 1.72 to 1, an even narrower range. On the other hand, sample 11 does show an average articles per journal figure that’s not much below the Shen/Björk estimate.

One More Try

What would happen if I assigned a new random number (again using RAND()) in each row and reran the eleven samples?

The results do begin to suggest that the difference between my nearly-full survey and the Shen/Björk study could be due to sample variation. To wit, this time the article totals range from 64,933 to 169,739, a range of 2.61 to 1. The lowest figure is less than half the actual figure, so it’s not entirely implausible that a sample could yield a number three times as high.

The total revenue range is also wider, from $22.7 million to $41.3 million, a range of 1.82 to 1. It’s still a stretch to get to $74 million, but not as much of a stretch. And in this set of samples, the cost per article ranges from $169.22 to $402.89, a range of 2.38 to 1. I should also note that at least one sample shows a mean articles-per-journal figure of 51.5, essentially identical to the Shen/Björk figure, and that $169.22 is similar to the Shen/Björk figure.


Sampling variation with 9% samples could yield numbers as far from the full-survey numbers as those in the Shen/Björk article, although for total article count it’s still a pretty big stretch.

But that article was using closer to 5% samples—and they weren’t actually random samples. Could that explain the differences?

[More to come? Maybe, maybe not.]

PPPPredatory Article Counts: An Investigation, Part 1

Monday, November 9th, 2015

If you read all the way through the December 2015 essay Ethics and Access 2015 (and if you didn’t, you really should!), you may remember a trio of items in The Lists! section relating to “‘Predatory’ open access: a longitudinal study of article volumes and market characteristics” (by Cenyu Shen and Bo-Christer Björk in BMC Medicine). Briefly, the two scholars took Beall’s lists, looked at 613 journals out of nearly 12,000, and concluded that “predatory” journals published 420,000 articles in 2014, a “stunning” increase from 50,000 articles in 2010—and that there were around 8,000 “active” journals that seemed to meet Jeffrey Beall’s criteria for being PPPPredatory (I’m using the short form).

I was indeed stunned by the article—because I had completed a complete survey of the Beall lists and found far fewer articles: less than half as many. Indeed, I didn’t think there were anywhere near 8,000 active journals either—if “active” means “actually publishing Gold OA articles” I’d put the number at roughly half that.

The authors admitted that the article estimate was just that—that it could be off by as much as 90,000. Of course, news reports didn’t focus on that: they focused on the Big Number.

Lars Bjørnshauge at DOAJ questioned the numbers and, in commenting on one report, quoted some of my own work. I looked at that work more carefully and concluded that a good estimate for 2014 was around 135,000 articles, or less than one-third of the Shen/Björk number—and my estimate was based on a nearly 100% actual count, not an estimate from around 6% of the journals.

As you may also remember, Björk dismissed these full-survey numbers with this statement:

“Our research has been carefully done using standard scientific techniques and has been peer reviewed by three substance editors and a statistical editor. We have no wish to engage in a possibly heated discussion within the OA community, particularly around the controversial subject of Beall’s list. Others are free to comment on our article and publish alternative results, we have explained our methods and reasoning quite carefully in the article itself and leave it there.”

I found that response unsatisfying (and find that I’ll approach Björk’s work with a much more jaundiced eye in the future). As I expected, the small-sample report continued (continues?) to get wider publicity, while my near-complete survey got very little.

The situation continued to bother me, because I don’t doubt that the authors did follow appropriate methodology and wonder how the results could be so wrong. How could they come up with more than twice as many active OA PPPPredatory journals and more than three times as many articles?

So I thought I’d look at my own work a little more, to see whether sampling could account for the wild deviation.

First Attempt: The Trimmed List

I began by taking my own copy of Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. The keys on each row of that 6,948-row spreadsheet are designed to be random. The spreadsheet includes not only the active Gold OA journals but also 3,673 others, to wit:

  • 2,045 that had not published any articles between 2011 and 2014, including eight that had explicitly ceased.
  • 183 that were hybrid journals, not gold OA.
  • 413 that weren’t really OA by my standards.
  • 279 that were difficult to count (more on those later).
  • 753 that were either unreachable or wholly unworkable.

There were two additional exclusions: I deleted around 1,100 journals (at least 300 of them empty ) from publishers that wouldn’t provide hyperlinked lists of their journal titles—and I deleted journals that are in DOAJ because there were even more reasons than usual to doubt the PPPPredatory label. (Note that the biggest group of that double-listed category, MDPI, has more recently been removed from Beall’s list.)

I wound up with 3,275 active gold OA journals, what I’ll call “secondary OA journals,” since I think of the DOAJ members as “serious OA journals” and don’t have a good alternative term.

As I started reworking the numbers, I thought there should be some accounting for the opaque publishers and journals. In practice, I knew from some extended sampling that most journals from opaque publishers were either empty or very small—and my sense is that most opaque journals (usually opaque because there are no online tables of contents, only downloadable PDF issues, but sometimes because there really aren’t streams of articles as such) are also fairly small. But still, they should be included. Since these two groups (excluding the 300-odd journals from opaque publishers that I knew were empty) added up to 32% of the count of active journals, I multiplied article and revenue counts by 1.32. (I think this is too high, but feel it’s better to err on the side that will get closer to the Shen/Björk numbers.)

I did not factor in the DOAJ-included numbers, but the total of those and other already-counted additional articles (doubling 2014 since I only counted January-June) is around 43,000 for 2014; around 39,000 for 2013; around 37,000 for 2012; and around 28,000 for 2011. You can add them to the counts below if you wish—although I don’t believe these represent questionable articles.


Since 613 was the sample size in the Shen/Björk article, I took a similar size sample as a starting point, then adjusted it so I could take five samples that would, among them, include everything: that is, a sample size of 655 journals.

For each sample (sorting by the pseudorandom key, then starting from the beginning and working my way down), I took the article count for each year, multiplying by appropriate factors, and the revenue counts for 2013 and 2014 (determined by multiplying the 2014 APC by the annual article counts, then applying the appropriate multipliers—I didn’t go back before 2013 because APCs were too likely to have changed). I calculated average APC per article for 2014 and 2013 by straight division—and also calculated the average article count (not including zero-count journals because the cells were blank rather than zero) and median article count (also excluding zero-count journals). I also calculated standard deviation just for amusement.

“Zero-count journals? Didn’t you eliminate zero-count journals?” I eliminated journals that had no articles in any year 2011-2014, but quite a few journals have articles in some years and not in others—including, of course, newish journals. For example, there were only 2,393 journals with articles in the first half of 2014; 2,714 in 2013; 1,557 in 2012 and 996 in 2011.

I also calculated the same figures for the full set.

Looking at the results, I was a little startled by the wide range, given that these samples were 20% of the whole: the 2014 projected article totals (doubling actual article counts, of course) ranged from 5,755 to 180,229! Now, of course, even that highest count is still much less than half of the Shen/Björk count—and just a bit over half if you add in the DOAJ-listed count.

So I added another column and assigned a random number to each row, using Excel’s RAND function, then froze the results and took a new set of five samples. The results were much narrower in range: 99,713 to 136,660. The actual total: 121,311 (including the 1.32 multiplier but not DOAJ numbers).

Table 1 shows the projected (or actual) article totals year-by-year and sample-by-sample, sorted so the lowest 2014 projection appears first. Note that samples 1-5 use the assigned pseudorandom keys, while samples 6-10 use Excel RAND function for randomization. Clearly, the latter yields more plausible results.

Sample 2014 2013 2012 2011
4 5,755 21,734 15,959 10,223
5 91,067 85,734 66,594 51,473
8 99,713 84,797 55,209 33,733
7 115,368 91,964 57,664 27,595
Total 121,311 99,994 64,325 34,543
6 123,050 104,808 57,295 22,605
9 131,762 106,181 82,790 53,869
10 136,660 112,220 68,666 34,914
3 159,284 121,097 75,933 27,628
1 170,148 138,890 87,371 56,027
2 180,299 132,515 75,768 27,364

Table 1. Estimated article counts by year

Adding the 43,000-odd articles from DOAJ-listed journals would bring these totals (ignoring samples 1-5) to around 143,000 to around 180,000 articles, with the most likely value around 165,000 articles: more than one-third of the Shen/Björk estimate but a lot less than half.

Note that “120,000 plus or minus 25,000” as an estimate actually covers all five samples that used the RAND-function randomization. Figure 1 shows the same data as Table 1, but in graphic form.

Figure 1. Estimated article counts by year

How much revenue might those articles have brought in, and what’s the APC per article? Keeping the order of samples the same as for Table 1 and Figure 1, Table 2 and Figure 2 show the maximum revenue (not allowing for waivers and discounts).

Sample 2014 2013
4 $2,952,893 $10,473,269
5 $1,677,496 $3,322,988
8 $30,184,480 $23,906,771
7 $35,939,416 $35,825,909
Total $31,863,087 $28,537,554
6 $31,010,206 $27,926,897
9 $31,165,754 $29,071,218
10 $31,015,578 $25,956,975
3 $82,610,167 $65,930,614
1 $34,247,360 $32,892,328
2 $37,827,517 $30,068,570

Table 2. Estimated maximum revenue, 2014 and 2013

This time there are two extremely low figures and one extremely high figure—with samples 6 through 10 all within $4.1 million of the actual maximum figure (for 2014: for 2013, the deviation is $7.3 million). Compare the $31.86 million calculated costs here with the $74 million estimated by Shen/Björk: the full-survey number is less than half as much.

Figure 2 shows the same information in graphical form.

Figure 2. Estimated maximum revenue, 2014 and 2013

Looking at APC per article, we run into an anomaly: where the Shen/Björk estimate is $178 for 2014, the calculated average for the full survey is considerably higher, $262.66. The range of the ten samples is from a low of $18.42 to a high of $513.08, but the five “good” samples range from $226.95 to $302.71, a reasonably narrow range.

Finally, consider the mean (average) number of articles per journal in 2014, in journals that had articles. The Shen/Björk figure is around 50; my survey yields 36.8. In fact, I show only 327 journals with at least 25 articles in the first half of 2014 (and only 267 with at least 50 articles in all of 2013).

The median is even lower—12 articles, or six in the first half—and that’s not too surprising. The standard deviation in most years was at least twice the average: as usual, these journals are very heterogeneous. How heterogeneous? In the first half of 2014, three journals had more than 1,000 articles each (but fewer than 1,300); six more had at least 500 articles; 16 had 250 to 499 articles—but at the same time, only 819 of the total had at least 11 articles in the first half of 2014, and only 1,544 had at least five articles in those six months.


I could find no way to get from these samples to the Shen/Björk figures. Not even close. They show too many active journals by roughly a factor of two, too many articles by a factor of close to three, and too much revenue by a factor of two—and too many articles per journal as well.

[Part 1 of 2 or 3…]

Note: This and following posts will also appear, probably in somewhat revised form, in the January 2016 issue of Cites & Insights.

Linguistics, OA, $430 and $1,400–and a bit about The Gold OA Landscape 2011-2014

Thursday, November 5th, 2015

I thought it might be interesting to glance at some existing gold OA journals at least partly devoted to linguistics in light of editorial goings-on at a notable subscription “hybrid” journal in the field.

This is a very incomplete group: it’s only journals I’d grouped into Language & Literature and that showed “linguis” somewhere within the DOAJ record (usually in the subject or keyword fields). That omits journals partly devoted to linguistics that fell into any number of other primary subject areas such as anthropology. But it’s a start…

The Basic Numbers

This group of journals consists of 275 journals (including only those graded “A” and “B” in The Gold OA Landscape 2011-2014). The journals published 5,954 articles in 2011; 6,725 in 2012; 6,973 in 2013; and a slight drop to 6,415 in 2014.

Article Processing Charges

Twelve of the 275 journals have article processing charges; the remaining 264 are funded through other means.

Those twelve journals did publish more articles per journal than the others: in total, 1,007 in 2011; 1,298 in 2012; 1,418 in 2013; and 1,493 in 2014.

APCs range from $37 to $600, but only one journal charged more than $400 and only three charged more than $300. (The only fee-charging journal with more than 200 articles in 2014 charged $40.)

The maximum paid for APCs in the twelve fee-charging journals in 2014 was $364,146; that comes out to a weighted average of $244 per article. (The average for all articles in these journals is $56.76.)

Grades and Fees

Of the 263 no-fee journals, 250 don’t have any obvious problems. Of the thirteen graded B, two have problematic English; three have garish sites or other site problems; one features a questionable impact factor; six have minimal information; one had other issues.

Of the dozen fee-charging journals, seven don’t have obvious problems. Of the five graded B (obviously a much higher percentage than for no-fee journals), one has a questionable impact factor and four make questionable claims–actually, the same questionable claim in all four cases: they claim to be Canadian but show no indication of significant Canadian editorial involvement.

Anyway…that’s a little information about a few existing gold OA journals that are at least partially devoted to linguistics.

The Gold OA Landscape 2011-2014: Language and Literature

Just a few notes in addition to what’s in the excerpted version–hoping this might encourage a few people and libraries to buy the paperback or site-licensed PDF, or find ways to help me continue this research.

  • Most journals in this field are small, even by the standards of humanities and social sciences: 350 published 18 articles or fewer in 2014, as compared to 91 with 19 to 30 articles, 51 with 31 to 50 articles, 24 with 51 to 120 articles…and eight journals with more than 120 articles in 2014. (Seven of those eight journals charge APCs–but the one that doesn’t published one-quarter of all the articles in the big eight journals.)
  • Journals in 55 countries published articles in 2014. Only one country–Brazil–accounted for 1,000 or more articles. United States and Canada followed (with more than 900 articles each–although that includes the Canadian journals that aren’t very Canadian). Spain was the only other country with more than 660 articles.

As always, there’s more in the book.

Quick status report: as of this morning (November 5, 2015):

  • At least 2,306 downloads of the Cites & Insights issue have happened
  • Seven copies of the book have been purchased, in addition to my own copy: Six paperback, one PDF ebook. That’s one copy for every three hundred downloads. [Note added November 6, 2015: PDF ebook sales have now doubled–another copy was purchased. Total sales are still single-digit, but it’s progress.]




Cites & Insights 15:11 (December 2015) available

Monday, November 2nd, 2015

The December 2015 issue of Cites & Insights (15:11) is now available for downloading at

This issue is 58 pages long. If you plan to read it online or on an ereader (ebook, tablet, whatever), you may prefer the single-column 6″ x 9″ edition, 111 pages long, at

This issue contains one essay:

Intersections: Ethics and Access 2015  pp. 1-58

No weird old tricks for reducing belly fat, but 102 items worth reading in a baker’s dozen of subtopics related to ethics and access (open and otherwise)–and #25 may astonish you! Or not.

No, it’s really not a listicle–otherwise I’d have to find 102 ads and free (or plagiarized) illustrations. It’s a bigger-than-usual roundup, with just a little humor (and a few exclamation points–and one interrobang).


Gold OA: the basis for going on (2 of 2)

Tuesday, October 27th, 2015

I’ll keep this one relatively short, as it’s about more direct appreciation of the gold OA research: namely, money. I’ve already responded to two people who might, conceivably, have money available for this research (neither one even suggested that it could happen), giving the amount I’d want–so I might as well be up-front and provide the options here.

1. The Donations + Purchases Route: Milestones

  • $1,500 total: the 2011-2014 spreadsheet, anonymized slightly, goes up on figshare.
  • $2,500 total: I give serious thought to renewing the project for 2015 data, using DOAJ’s journal list as of the first week of 2016.
  • $5,000 total: I’d definitely do the 2011-2015 version and make the spreadsheet available on figshare.

That total includes donations to Cites & Insights since the 2011-2014 project was announced and net proceeds from sales of all of my self-published books since September 1, 2015 (and, for that matter, the honorarium portion of expenses-paid speaking engagements related to this work, but I’m not holding my breath for any of those).

As previously noted, through right now, we’re more than one-third of the way but less than halfway to the first milestone.

(If the second milestone isn’t reached by April 2016, I don’t think this would happen–I’d have moved on to other things by then.)

2. Direct Grant Funding or Consulting Contract: Annual Costs

This is the set of numbers I sent back to two interested parties. It would cover another round of research, including rechecking APC status and amount for all listed journals, tweaking the grading criteria slightly, writing up the research, and making the anonymized spreadsheet available on figshare and the PDF version of the results available for free. (The paperback version would be priced at very close to production costs, quite probably less than $10.)

My price would be, at minimum, $0.50 per journal in DOAJ in the first week of 2016, plus $1,000 for the analysis/writeup phase. Right now, that would come to about $6,332.

I’d be delighted to discuss this with any possible agency or agencies (actually, there’s one exception–not the one in Ohio–but I don’t think that’s likely to be an issue). If the money was secure before 2016, I could do some of the APC/site rechecking before 2016. If more discussion and tweaks are desired, the price might be higher.

Obviously, the sponsor(s) would or could have their names on the results or could even handle distribution.

3. Part-time Consulting Research

I believe this project will require at least 500 to 600 hours to do properly, so if somebody wanted to hire me as a quarter-time consulting researcher to carry on this project (for one or more years), I’d certainly consider it. (I’m assuming that nobody hiring a consultant or researcher in California pays less than $26,000/year, esp. since California minimum wage is likely to be $30,000 before too long.)

Obviously, I’d expect to discuss possible expansions and tweaks, and the agency could release the report under its name, with me credited somewhere.

Oh, one more thing:

4. Redoing the Beall’s Lists Investigation

That would cost a lot of money because it’s neither interesting nor fun nor, I believe, especially useful. If someone was determined, I’d consider it for $1 per journal within Beall’s lists plus $2,000 for analysis and writeup–that is, a minimum of $13,000 (and going up all the time!). But I’d probably turn it down even then: life really is too short.

[Oh, by the way: if you’re interested in funding this research, contact me at]

Gold OA: The basis for going on (1 of 2)

Tuesday, October 27th, 2015

At this point–seven weeks after The Gold OA Landscape 2011-2014 was published–it seems like a good time to discuss the issues surrounding possible continuation of this full-survey research for another year (that is, covering 2015, done in 2016).

Part 2 will deal with finances: what it would take to make it happen.

This part deals with a related question: Since I’m not depending on this revenue to keep meals on the table or a roof over our heads, why do I need any revenue for it at all?

[No, nobody’s said that quite so flatly. Still: every time somebody says “there’s something wrong with charging for a writeup about open access or the research it took to do that writeup, because OA’s supposed to be free,” or something of the sort–which has happened every time I or ALA (or MIT) has published something on OA that carries a price–once I calm down, I turn it into the question above.]

Turns out, this is a philosophical question of sorts: Namely, what motivates me to do anything (other than lie around the house, do some housework, read books, watch TV, go for walks and like that)?

That question’s been clarified in my own mind over the years since it’s become clear that Cites & Insights itself is unlikely to attract significant contributions (the total has never reached the high three figures in a year, much less four figures). Here’s how I’ve worked it out in my own head, although I’m sure it’s an incomplete model.

I see four factors: Fun, Interest, Worth/Usefulness/Effectiveness, and Appreciation. Two are internal, two external.


I do some essays in Cites & Insights because they’re fun or amusing to me. Certainly true of The Back, The Middle, most Media essays (esp. old movies). That’s part of why I started looking at liblogging, library blogging and library slogans (and, for that matter, library use of social media): it was fun.

“Fun” and “interesting” can overlap in slightly unpredictable ways. It was, initially, fun to unveil the realities behind Beall’s lists, and in some ways it’s been fun to see how well Chrome/Google does or does not translate non-English journal websites (and to appreciate some of the blank verse generated by some translations).


I have lots of interests, and I’ll pursue an interest to what might possibly be considered extremes–I’m a completist in some areas. It has certainly been interesting to examine the Gold OA landscape in detail, and once I got well into it I realized that I wanted to see it through.

Interest certainly explains some ongoing features in Cites & Insights. I don’t find copyright discussions particularly amusing, but they’re interesting, just as one example.

But I have lots of interests, and could readily cultivate more. And time eventually does become a limiting factor. At this point, I don’t expect to live for more than 30 years or so–possibly quite a bit less, probably not much more. (For a long time, I’d pegged 93 as my desirable stopping point; I’ve moved that to 98–which gives me 28 more years–as long as I’m im good mental and reasonable physical health. I have no desire to live to 103 or 108 or some extreme old age–but ask me again 20 years from now, I suppose.) There are a lot of books I’d like to read and quite a few I wouldn’t mind rereading; there are a lot of movies I want to watch; I read and enjoy quite a few magazines (and one daily “paper”); there’s a fair amount of TV I enjoy watching (although probably very little by most people’s standards); lots of music to pay attention to; and… and… and…

So at a certain point I have to balance competing interests, especially since time is finite and some significant portion of it is taken up with household maintenance, family life, sleep (yes, I get 7.5 to 8 hours a day; no, I’m not willing to reduce that much), vacations, exercise and long walks/hikes, etc…

Balance isn’t much of an issue when I’m choosing a book that may take 4-5 hours to read or an essay that may take 5-10 hours to write. It’s a lot more of an issue when I’m contemplating a project that would probably take 500 to 600 hours over the course of six or seven months.

Which is to say: I find the ongoing story of gold OA interesting. Do I find it interesting enough to give up 500-600 hours per year of other stuff? Which brings us to:


When something’s fun and not too time-consuming, this and the final factor don’t come into play.

When it’s a question of balance and which projects are worth starting or continuing, this and the final factor definitely do come into play.

To wit: what is this worth (and how useful is it) to me and other people?

(Yes, this and the final factor overlap a lot. That’s how life is.)

I look at readership, citations, and things like that as indications of worth and usefulness. If an issue of C&I is only read 200 times over the course of three months, it apparently wasn’t found to be worthwhile or useful; if it’s read 2,000 times over three months, it apparently was worthwhile or useful.

Of course, worth can also have a financial aspect, which gets more into appreciation: do people find something sufficiently useful or worthwhile to pay for it?

I recognized that my series of books on liblogging had ceased to be worthwhile/useful about a year too late, when sales declined to pretty much nothing and readership for related C&I issues declined substantially. But I did eventually recognize it and stopped doing the series. (A ten-year recap might or might not happen; if it does, it will be at a “this might be fun/interesting” level, not a “people might be willing to buy this” level–there wouldn’t be a book.)

There have been other themes in Cites & Insights that have disappeared because it appeared that people didn’t find them useful or worthwhile. Indeed, I stopped doing individual HTML essays because there didn’t seem to be much demand for them (and it was clear nobody found them worthwhile enough to pay for) and they were never interesting or fun to do–while the single-column version of C&I has proven to be useful enough to keep doing.

As to effectiveness: that’s so hard to measure that I generally ignore it–but I do have to mention it within this discussion.

So how does the OA research fall on the interesting/worthwhile axis?

Journal Readership

Looking at OA-related issues of Cites & Insights over the past two years, including research-based ones and others, I find the following download numbers through this morning at 5:30 a.m. (but missing most of the last day of each month):

  • April 2014 (“The Sad Case of Jeffrey Beall”): 10,576, one of the highest total downloads figures ever — but in terms of effectiveness, I look at how often the lists continue to be used as the basis for policy or, sigh, “research,” and have to wonder whether there’s been any real effect at all.
  • May 2014 (“The So-Called Sting”): 4,126 downloads, a high figure.
  • July 2014 (“Journals, ‘Journals’ and Wannabes”): 5,121–a high figure, and since this was a full-issue essay, I can reasonably assume that the readership was entirely related to this essay.
  • August 2014 (“Access and Ethics 3”): 1,643, a decent-but-not-great figure.
  • October/November 2014 (“Journals and ‘Journals’: Taking a Deeper Look”): 1,704, another decent-but-not-great figure.
  • December 2014 (…Part 2): 1,669, another decent-but-not-great figure.
  • January 2015 (“The Third Half”): 2,783, a good solid figure, especially since it represents less than a year.
  • March 2015 (“One More Chunk of DOAJ”): 2,281, a good solid figure, but in this case the essay taking up most of the issue–“Books, E and P, 2014”–probably accounts for much of that, since that’s always been a hot topic.
  • April 2015 (“The Economics of Open Access”): 2,476, a good solid figure–and this one’s a single-essay issue.
  • June 2015 (“Who Needs Open Access Anyway?”): 1,595, a decent figure for five months.
  • July 2015 (“Thinking About Libraries and Access, Take 2”): 839 downloads–and this one’s a little disappointing because that essay was my own take on/beliefs about OA. This suggests that people are a lot less interested in what I think than in what I’ve found out through research. That’s OK, of course…but…
  • October 2015 (“The Gold OA Landscape 2011-2014”): 2,169 downloads in the first seven weeks or so, which I regard as very good numbers, especially for the first couple of months.



This shows up in citations elsewhere, tweets and the like, but also in donations and sales (and, heck, speaking invitations–one of the coins of the realm, but there haven’t been any in a couple of years–certainly none related to this research).

When it comes to citations, I don’t have any real complaints; ditto tweets.

As far as donations: still in the low three digits, and that was mostly when I was offering a free ebook and production-priced paperback. None since the project was completed (other than two very small recurring donations that are for C&I, not OA research.)

As for sales…

Book Purchases

For the same period–the books appeared a couple of days before the October 2015 issue did–here’s what I show, not including my own copy: Seven paperback copies, one site-licensed PDF ebook. Total: Eight copies.

In other words, not even one-half of one percent of those who’ve downloaded the October 2015 issue have, so far, found the research sufficiently worthwhile to buy the full story.

Of course, there could be dozens, nay, hundreds of orders just waiting to go to Amazon or Ingram.

So where does this leave me? Wondering whether the effectiveness and demonstrated worth is enough to justify doing it again.

(If you’re wondering, I’d say total revenue counted toward this project–including all donations and all self-published book sales of any sort since September 1, 2015–is more than one-third of the way, but considerably less than halfway, toward being enough to make the anonymized spreadsheet available on figshare. It’s a bit more than one-fifth of the way toward making me think seriously about doing it again.)

Which brings us to Part 2, later today or maybe another day.



Gold OA: How many no-fee articles?

Monday, October 26th, 2015

Earlier this year, in a comment stream on a blog post about open access and fees, one commenter (from the commercial journal field) asked whether there were any actual numbers on how many articles were published in gold OA journals that don’t charge APCs or other author-side fees.

At the time, another commenter responded with my figures from the partial study of gold OA journals, the one that didn’t include journals without English-language interfaces. The total from 2012 through 2014 was around 470,000 articles.

The Gold OA Landscape 2011-2014 includes graphs showing free and paid article counts overall and for each segment and subject, and shows overall article counts and the percentage of free articles, making it easy to calculate approximate counts, but I didn’t actually include the figures that create the graphs; that would have been redundant and I was trying to keep the book as short as possible. And the excerpted version in Cites & Insights didn’t include graphs at all.

So, for what it’s worth, here are some key figures for articles published in serious gold OA journals (those listed in DOAJ ) and graded A or B in my study) that do not charge APCs or other author-side fees.


Among the DOAJ journals included in The Gold OA Landscape 2011-2014 (grades A & B), 7,048 did not charge APCs. Those journals published 177,855 articles in 2011, 198,552 articles in 2012, 206,561 articles in 2013 and 206,588 in 2014. That’s a total of 789,556 articles during the four-year period in serious gold OA journals without author-side fees.

Biology and Medicine

For this segment, there were 57,627 articles in 2011, 63,411 in 2012, 64,735 in 2013 and 66,057 in 2014, for a total of 251,830 articles during the four-year period in serious gold OA journals without author-side fees.

Science, Technology, Engineering and Math

Serious gold OA journals without APCs or other author-side fees in this segment published 52,892 articles in 2011; 59,593 in 2012; 64,637 in 2013; and 65,088 in 2014, for a total of 242,210 articles during the four-year period.

Humanities and Social Sciences

Serious gold OA journals without APCs or other author-side fees in this segment published 67,350 articles in 2011; 75,556 in 2012; 77,189 in 2013; and 75.443 in 2014, for a total of 295,538 articles during the four-year period.

Surprised that there were more no-fee articles in the humanities and social sciences than in either biomed or STEM? You shouldn’t be.

By the way, today (Monday, October 26, 2015) is the last day to get 30% off this or any other Lulu books using coupon code OCTFLASH30.