Archive for November, 2015

One-third of the way there!

Sunday, November 22nd, 2015

With today’s French purchase of a PDF copy of The Gold OA Landscape 2011-2014, and including Cites & Insights Annual purchases, we’re now one-third of the way to the first milestone, at which I’ll upload an anonymized version of the master spreadsheet to figshare. (As with a previous German purchase, I can only assume the country based on Lulu country codes…)

Now an even dozen copies sold.

One sale gone, another started: 25%

Friday, November 20th, 2015

When you go to buy my books, always check the Lulu home page for discounts. Just a reminder…

I’m guessing there will be a series of brief sales for a while, but can’t be sure. In the meantime:

SHOP25 as a coupon code gets you 25% off print books (and calendars, if you’re so inclined) from now through November 23, 2015.

Coupon codes are case sensitive.

Another reminder: you’re not decreasing my net revenue (counted toward future research) by using these sale codes–I get the same net revenue.

For various reasons, I took a look yesterday at all-time Lulu sales (it takes me one minute to generate that spreadsheet and not much longer to go through it). I noticed something that, because it’s at such a low level, had slipped my attention.

To wit: yes, occasionally somebody does buy a Cites & Insights Annual edition. Excluding my own copies, there have been sixteen such sales over the years, with the most being 2007 (4 copies) and 2008 (3 copies); the only one with no outside sales to date is the latest, 2015. Since I produce these so I’ll have my own copy (if I include cost of paper and inkjet ink, it’s actually cheaper for me to buy one at my author’s price than it is to print out a new copy of each issue and have Fedex Kinko’s bind it in an ugly Velobind binding–and the result is both more handsome and more usable), this is a nice extra. Of course, it’s also a great way to have past issues on hand…

Five thousand pages!

Thursday, November 19th, 2015

I maintain a little spreadsheet to track word and page counts for Cites & Insights [with the slightly-out-of-date name “first10 length”]. I print it out every month ortwo but I don’t look at it very often.

And I missed a milestone of sorts: through the December 2015 issue (not including phantom issues that are only in the annual paperbacks), C&I has passed the 5,000-page mark: in all, 5,002 pages. (If you’re wondering, the longest volume was volume 9, 2009, with 418 pages; the shortest were volume 1 [252 pages including the preview issue], volume 2 [262 pages], and volume 11 [274 pages: the year C&I almost shut down for good].

Word count’s not at a milestone; it should hit four million words in two to four months.

No deeper meaning; just marking a wordy milestone. It’s a handsome set of paperbacks on one of my bookshelves–although the first five volumes are sort of ugly, being Velobound things produced at Kinko’s. In case you weren’t aware, volumes 6 through 15 are all available, $45 each [with occasional Lulu discounts: check the front page], with roughly half the proceeds going to continue C&I and my OA research. Oh, and on most of them you get a huge photo from our travels–all of them have such photos, but in all but two the photo’s a wraparound, 11″ high and close to 18″ wide. More information here.

Cites & Insights Annual, Two-Day Sale and a Non-Update

Wednesday, November 18th, 2015

I have it down to do another teaser post to help convince folks that there’s loads of great stuff in The Gold OA Landscape 2011-2014, either paperback or site-licensed PDF ebook–but given that there’s only been one copy sold in November to date, and indeed only one since October 22, maybe that’s a waste of my energy.

That’s the non-update: the total continues to be nine paperback copies and two PDF ebooks, with five copies showing up in Worldcat.org. Special arrangements (grants, donations, consulting, etc.) unchanged.

Meanwhile: if you do want the paperback–or any or all of my other self-published books–you can buy them today and tomorrow (November 19, 2015) for 20% off using the coupon code PRESALE20

[Any time you do buy stuff at Lulu, check the home page: it should show current offers.]

And then there’s the Cites & Insights Annual edition for 2015; I’ve now received my copy (and modified the cover, since the title was a little too far down the page).

Here’s the skinny:

Volume 15 is 354 pages long (including table of contents and indices) and, as usual, $45 (or $36 today and tomorrow).

Highlights of this 11-issue volume include:

  • Three full-issue essays related to Open Access: Economics, The Gold OA Landscape 2011-2014, and Ethics
  • A fair use trilogy: Google Books, HathiTrust and miscellaneous topics
  • More pieces of the OA puzzle, mostly leading up to The Gold OA Landscape
  • The usual: Deathwatch, Ebooks & Pbooks; a eulogy to FriendFeed and some notes on Twitter; and more

And the indices that aren’t otherwise available.
The photo: the library at Ephesus–a familiar view if you own Public Library Blogs: 252 Examples but this is a slightly different photo and a considerably larger view

Oops: while Public Library Blogs: 252 Examples used a different picture of The Library At Ephesus, The Liblog Landscape 2007-2010 used the same picture–but much larger, with a little more touchup, and using Paint.net’s auto-equalization, which yielded a slightly different color range.

Lagniappe: The Rationales, Once Over Easy

Friday, November 13th, 2015

[This is the unexpected fourth part of PPPPredatory Article Counts: An Investigation. Before you read this, you should read the earlier posts—Part 1, Part 2 and Part 3—and, of course, the December 2014 Cites & Insights.]

Yes, I know, it’s hard to call it lagniappe when it’s free in any case, I did spend some time doing a first-cut version of the third bullet just above: That is, did I find clear, cogent, convincing explanations as to why publishers were questionable?

I only looked at 223 multijournal publishers responsible for 6,429 journals and “journals” (3,529 of them actual gold OA journals actually publishing articles at some point 2011-2014) from my trimmed dataset (excluding DOAJ journals and some others). I did not look at the singleton journals; that would have more than doubled the time spent on this.

Basically, I searched Scholarly Open Access for each publisher’s name and read the commentary carefully—if there was a commentary. It there was one, I gauged whether it constituted a reasonable case for considering all of that publisher’s journals sketchy at the time the commentary was written, or if it fell short of being conclusive but made a semi-plausible case. (Note the second italicized clause above: journals and publishers do change, but they’re only removed from the list after a mysterious appeals process.)

But I also looked at my own annotations for publishers—did I flag them as definitely sketchy or somewhat questionable, independently of Beall’s comments? I’m fairly tough: if a publisher doesn’t state its APCs or its policy or makes clearly-false statements or promises absurdly short peer review turnaround, those are all red flags.

Beall Results

For an astonishing 65% of the publishers checked there was no commentary. The only occurrences of the publishers’ names were in the lists themselves.

The reason for this is fairly clear. Beall’s blog changed platforms in January 2012, and Beall did not choose to migrate earlier posts. These publishers—which account for 41% of the journals and “journals” in my analysis and 38% of the active Gold OA journals—were presumably earlier additions to the list.

This puts the lie to the claims of some Beall fans that he clearly explains why each publisher or journal is on the list, including comments from those who might disagree. That claim is simply not true for most of the publishers I looked at, representing 38% of the active journals, 23% of the 2014 articles, and 20% of the projected 2014 revenues.

My guess is that it’s worse than this. I didn’t attempt to find individual journals, but although those journals only represent 5% of the active journals I studied, they’re extremely prolific journals, accounting for 38% of 2014 articles (and 13% of 2014 potential revenue).

If Beall was serious about his list being a legitimate tool rather than a personal hobbyhorse, of course, there would be two ongoing lists (one for publishers, one for authors) rather than an annual compilation—and each entry would have two portions: the publisher or journal name (with hyperlink), and a “Rationale” tab linking to Beall’s explanation of why the publisher or journal is there. (Those lists should be pages on the blog, not posts, and I think the latest ones are.) Adding such links, linking to posts would be relatively trivial compared to the overall effort of evaluating publishers, and it would add considerable accountability.

In another 7% of cases, I couldn’t locate the rationale but can’t be sure there isn’t one: some publishers have names composed of such generic words that I could never be quite sure whether I’d missed a post. (The search box doesn’t appear to support phrase searches.) That 7% represents 4% of active journals in the Beall survey, 4% of 2014 articles, but only 1.7% of potential 2014 revenue.

Then there are the others—cases where Beall’s rationale is available. As I read the rationales, I conclude that Beall made a sufficiently strong case for 9% of the publishers, a questionable but plausible case for 11%–and, in my opinion, no real case for 9% of the publishers.

Those figures break out to active journals, articles and revenues as follows:

  • Case made—definitely questionable publishers: 22% of active journals, 11% of 2014 articles, 41% of 2014 potential revenues. (That final figure is particularly interesting.)
  • Questionable—possibly questionable publishers: 16% of active journals, 16% of 2014 articles, 18% of 2014 potential revenues.
  • No case: 14% of active journals, 7% of 2014 articles, 6% of 2014 potential revenues.

If I wanted to suggest an extreme version, I could say that I was able to establish a strong case for definitely questionable publishing for fewer than 12,000 published articles in 2014—in other words, less than 3% of the activity in DOAJ-listed journals.

But that’s an extreme version and, in my opinion, dead wrong, even without noting that it doesn’t allow for any of the independent journals (which accounted for nearly 40,000 articles in 2014) being demonstrably sketchy.

Combined Results

Here’s what I find when I combine Beall’s rationales with my own findings when looking at publishers, ignoring independent journals:

  • Definitely questionable publishers: Roughly 19% of 2014 articles, or about 19,000 within the subset studied, and 44% of potential 2014 revenue, or about $11.4 million. (Note that the article count is still only about 4% of serious OA activity—but if you add in all independent journals, that could go as high as 59,000, or 12%.) Putting it another way, about 31% of articles from multijournal publishers in Beall’s list were in questionable journals.
  • Possibly questionable publishers: Roughly 21% of 2014 articles (34% excluding independent journals) and 21% of 2014 potential revenues.
  • Case not made: Roughly 22% of 2014 articles (36% excluding independent journals) and 22% of 2014 potential revenues.

It’s possible that some portion of that 22% is sketchy but in ways that I didn’t catch—but note that the combined score is the worst of Beall’s rationale or my independent observations.

So What?

I’ve said before that the worst thing about the Shen/Björk study is that it’s based on a fatally flawed foundation, a junk list of one man’s opinions—a man who, it’s increasingly clear, dislikes all open access.

My attempts to determine Beall’s cases confirmed that opinion. In far too many cases, the only available case is “trust me: I’m Jeffrey Beall and I say this is ppppredatory.” Now, of course, I’ve agreed that every journal is ppppredatory, so it’s hard to argue with that—but easy to argue with his advice to avoid all such journals, except as a call to abandon journal publishing entirely.

Which, if you look at it that way, makes Jeffrey Bell a compatriot to Björn Brembs. Well, why not? In his opposition to all Gold OA, he’s already a compatriot to Stevan Harnad: the politics of access makes strange alliances.

Otherwise, I think I’d conclude that perhaps a quarter of articles in non-DOAJ journals are from publishers that are just…not in DOAJ. The journals may be serious OA, but the publishers haven’t taken the necessary steps to validate that seriousness. They’re in a gray area.

Monitoring the Field

Maybe this also says something about the desirability of ongoing independent monitoring of the state of gold OA publishing. When it comes to DOAJ-listed journals, my approach has been “trust but verify”: I checked to make sure the journals actually did make APC policies and levels clear, for example, and that they really were gold OA journals. When it comes to Beall’s lists, my approach was “doubt but verify”: I didn’t automatically assume the worst, but I’ll admit that I started out with a somewhat jaundiced eye when looking at these publishers and journals.

I also think this exercise says something about the need for full monitoring, rather than sampling. The differences between even well-done sampling (and I believe Shen/Björk did a proper job) and full monitoring, in a field so wildly heterogeneous as scholarly journals, is just too large: about three to one, as far as I can tell.

As I’ve made clear, I’d be delighted to continue such monitoring of serious gold OA (as represented by DOAJ), but only if there’s at least a modest level of fiscal support. The door’s still open, either for hired consultation, part-time employment, direct grants or indirect support through buying my books (at this writing, sales are still in single digits) or contributing to Cites & Insights. But I won’t begin another cycle on spec: that single-digit figure [barely two-digit figure, namely 10 copies] after two full months, with no apparent likelihood of any other support, makes it foolhardy to do so. (waltcrawford@gmail.com)

As for the rest of gold OA, the gray area and the questionable publishers, this might be worth monitoring, but I’ve said above that I’m not willing to sign up for another round based on Beall’s lists, and I don’t know of any other good way to do this.

PPPPredatory Article Counts: An Investigation Part 3

Wednesday, November 11th, 2015

If you haven’t read Part 1 and Part 2—and, to be sure, Cites & Insights December 2015—none of this will make much sense.

What would happen if I replicated the sampling techniques actually used in the study (to the extent that I understand the article)?

I couldn’t precisely replicate the sampling. My working dataset had already been stripped of several thousand “journals” and quite a few “publishers,” and I took Beall’s lists a few months before Shen/Björk did. (In the end, the number of journals and “journals” in their study was less than 20% larger than in my earlier analysis, although there’s no way of knowing how many of those journals and “jour*nals” actually published anything. In any case, if the Shen/Björk numbers had been 20% or 25% larger than mine, I would have said “sounds reasonable” and let it go at that.)

For each tier in the Shen/Björk article, I took two samples, both using random techniques, and for all but Tier 4, I used two projection techniques—one based on the number of active true gold OA journals in the tier, one based on all journals in the tier. (For Tier 4, singleton journals, there’s not enough difference between the two to matter much.) In each tier, I used a sample size and technique that followed the description in the Shen/Björk article.

The results were interesting. Extreme differences between the lowest sample and the highest sample include 2014 article counts for Tier 2 (publishers with 10 to 99 journals), the largest group of journals and articles, where the high sample was 97,856 and the low—actually, in this case, the actual counted figure—was 46,770: that’s a 2.09 to 1 range. There’s also maximum revenue, where the high sample for Tier 2 was $30,327,882 while the low sample (once again the counted figure) was $9,574,648: a 3.17 to 1 range—in other words, a range wide enough to explain the difference between my figures and the Shen/Björk figures purely on the basis of sample deviation. (It could be worse: the 2013 projected revenue figures for Tier 2 range from a high of $41,630,771 to a low of $8,644,820, a range of 4.82 to 1! In this case, the actual sum was just a bit higher than the low sample, at $8,797,861.)

Once you add the tiers together, the extremes narrow somewhat. Table 7 shows the low, actual, and high total article projections, noting that the 2013, 2012, and 2011 low and high might not be the actual extremes (I took the lowest and highest 2014 figures for each tier, using the other figures from that sample.) It’s still a broad range for each year, but not quite as broad. (The actual numbers are higher than in earlier tables largely because journals in DOAJ had not been excluded at the time this dataset was captured.)

2014 2013 2012 2011
Low 134,980 130,931 92,020 45,605
Actual 135,294 115,698 85,601 54,545
High 208,325 172,371 136,256 84,282

Table 7. Article projections by year, stratified sample

The range for 2014 is 1.54 to 1: broad, but narrower than in the first two attempts. On the other hand, the range for maximum revenues is larger than in the first two attempts: 2.18 to 1 for 2014 and a very broad 2.46 to 1 for 2013, as in Table 8.

2014 2013
Low $30,651,963 $29,145,954
Actual $37,375,352 $34,460,968
High $66,945,855 $71,589,249

Table 8. Maximum revenue projections, stratified sample

Note that the high figures here are pretty close to those offered by Shen/Björk, whereas the high mark for projected article count is still less than half that suggested by Shen/Björk. (Note also that in Table 7, the actual counts for 2013 and 2012 are actually lower than the lowest combined samples!)

For the graphically inclined, Figure 4 shows the low, actual and high projections for the third sample. This graph is not comparable to the earlier ones, since the horizontal axis is years rather than samples.

Figure 4. Estimated article counts by year, stratified

It’s probably worth noting that, even after removing thousands of “journals” and quite a few publishers in earlier steps, it’s still the case that only 57% of the apparent journals were actual, active gold OA journals—a percentage ranging from 55% for Tier 1 publishers to 61% for Tier 3.

Conclusion

It does appear that, for projected articles, the stratified sampling methodology used by Shen/Björk may work better than using a pure random sample across all journals—but for projected revenues, it’s considerably worse.

This attempt could answer the revenue discrepancy, which in any case is a much smaller discrepancy (as noted, my average APC per article is considerably higher than Shen/Björk’s)—but it doesn’t fully explain the huge difference in article counts.

Overall Conclusions

I do not doubt that Shen/Björk followed sound statistical methodologies, which is quite different than agreeing that the Beall lists make a proper subject for study. The article didn’t identify the number of worthless articles or the amount spent on them; it attempted to identify the number of articles published by publishers Beall disapproved of in late summer 2014, which is an entirely different matter.

That set aside, how did the Shen/Björk sampling and my nearly-complete survey wind up so far apart? I see four likely reasons:

  • While Shen/Björk accounted for empty journals (but didn’t encounter as many as I did), they did not control for journals that have articles but are not gold OA journals. That makes a significant difference.
  • Sampling is not the same as counting, and the more heterogeneous the universe, the more that’s true. That explains most of the differences, I believe (on the revenue side, it can explain all of them).
  • The first two reasons, enhanced by two or three months’ of additional listings, combined to yield a much higher estimate of active journals than my survey: more than twice as many.
  • The second reason resulted in a much higher average number of articles per journal than in my survey (53 as compared to 36), which, combined with the doubled number of journals, neatly explains the huge difference in article counts.

The net result is that, while Shen/Björk carried out a plausible sampling project, the final numbers raise needless alarm about the extent of “bad” articles. Even if we accept that all articles in these projections are somehow defective, which I do not, the total of such articles in 2014 appears to be considerably less than one-third of the number of articles published in serious gold OA journals (that is, those in DOAJ)—not the “nearly as many” the study might lead one to assume.

No, I do not plan to do a followup survey of publishers and journals in the Beall lists. It’s tempting in some ways, but it’s not a good use of my time (or anybody else’s time, I suggest). A much better investigation of the lists would focus on three more fundamental issues:

  • Is each publisher on the primary list so fundamentally flawed that every journal in its list should be regarded as ppppredatory?
  • Is each journal on the standalone-journal list actually ppppredatory?
  • In both cases, has Beall made a clear and cogent case for such labeling?

The first two issues are far beyond my ken; as to th first, there’s a huge difference between a publisher having some bad journals and it making sense to dismiss all of that publisher’s journals. (See my longer PPPPredatory piece for a discussion of that.)

Then there’s that final bullet…

[In closing: for this and the last three posts—yes, including the Gunslingers one—may I once again say how nice Word’s post-to-blog feature is:? It’s a template in Word 2013, but it works the same way, and works very well.]

PPPPredatory Article Counts: An Investigation Part 2

Monday, November 9th, 2015

If you haven’t already done so, please read Part 1—otherwise, this second part of an eventual C&I article may not make much sense.

Second Attempt: Untrimmed List

The first five samples in Part 1 showed that even a 20% sample could yield extreme results over a heterogeneous universe, especially if the randomization was less than ideal.

Given that the most obvious explanation for the data discrepancies is sampling, I thought it might be worth doing a second set of samples, this time each one being a considerably smaller portion of the universe. I decided to use the same sample size as in the Shen/Björk study, 613 journals—and this time the universe was the full figshare dataset Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. I assigned RAND() on each row, froze the results, then sorted by that column. Each sample was 613 journals; I took 11 samples (leaving 205 journals unsampled but included in the total figures). I adjusted the multipliers.

More than half of the rows in the full dataset have no articles (and no revenue). You could reasonably expect extremely varied results—e.g., it wouldn’t be improbable for a sample to consist entirely of no-article journals or of all journals with articles (thus yielding numbers more than twice as high as one might expect).

In this case, the results have a “dog that did not bark in the night” feel to them. Table 3 shows the 11 sample projections and the total article counts.

Sample 2014 2013 2012 2011
6 88,165 72,034 40,801 20,473
10 91,186 75,025 50,820 31,523
5 95,338 93,886 56,047 27,893
4 97,313 80,978 51,343 36,039
1 99,956 97,153 83,606 52,983
2 105,967 87,468 50,617 20,880
7 106,693 72,658 40,119 29,055
Total 121,311 99,994 64,325 34,543
9 127,747 100,653 73,326 32,075
3 140,292 122,128 77,958 36,634
8 154,754 114,360 79,323 35,632
11 160,591 143,312 91,011 53,579

Table 3. Article projections by year, 9% samples

Although these are much smaller samples (percentagewise) over a much more heterogeneous dataset, the range of results is, while certainly wider than for samples 6-10 in the first attempt, not dramatically so. Figure 3 shows the same data in graphic form (using the same formatting as Figure 1 for easy comparison).

Figure 3. Estimated article counts by year, 9% sample

The maximum revenue samples show a slightly wider range than the article count projections: 2.01 to one, as compared to 1.82 to 1. That’s still a fairly narrow range. Table 4 shows the figures, with samples in the same order as for article projections (Table 3).

Sample 2014 2013
6 $27,904,972 $24,277,062
10 $32,666,922 $27,451,802
5 $19,479,393 $20,980,689
4 $24,975,329 $25,507,720
1 $30,434,762 $30,221,463
2 $30,793,406 $25,461,851
7 $30,725,482 $21,497,760
Total $31,863,087 $28,537,554
9 $29,642,696 $24,386,137
3 $39,104,335 $41,415,454
8 $36,654,201 $29,382,149
11 $35,420,001 $34,710,583

Table 4. Estimated Maximum Revenue, 9% samples

As with maximum revenue, so with cost per article: a broader range than for the last five samples (and total) in the first attempt, but a fairly narrow range, at 1.75 to 1, as shown in Table 5.

Sample 2014 2013
6 $316.51 $337.02
10 $358.25 $365.90
5 $204.32 $223.47
4 $256.65 $315.00
1 $304.48 $311.07
2 $290.59 $291.10
7 $287.98 $295.88
Total $262.66 $285.39
9 $232.04 $242.28
3 $278.73 $339.12
8 $236.85 $256.93
11 $220.56 $242.20

Table 5. APC per article, 9% samples and total

Rather than providing redundant graphs, I’ll provide one more table: the average (mean) articles per journal (ignoring empty journals), in Table 6.

Sample 2014 2013 2012 2011
6 27.85 20.59 20.66 16.79
10 29.35 20.75 22.73 23.10
1 30.06 25.54 38.13 38.41
5 30.26 27.63 27.18 20.88
4 31.46 22.86 23.42 29.90
2 33.94 24.79 25.08 15.14
7 34.66 20.68 20.17 22.48
Total 36.80 27.47 30.08 25.51
3 42.01 34.90 38.63 27.13
9 42.10 29.75 35.82 26.30
8 43.86 31.25 38.20 26.39
11 47.88 40.12 47.13 38.04

Table 6. Average articles per journal, 9% samples

Note that Table 6 is arranged from lowest average in 2014 to highest average; the rows are not (quite) in the same order as in Tables 3-5. The range here is 1.72 to 1, an even narrower range. On the other hand, sample 11 does show an average articles per journal figure that’s not much below the Shen/Björk estimate.

One More Try

What would happen if I assigned a new random number (again using RAND()) in each row and reran the eleven samples?

The results do begin to suggest that the difference between my nearly-full survey and the Shen/Björk study could be due to sample variation. To wit, this time the article totals range from 64,933 to 169,739, a range of 2.61 to 1. The lowest figure is less than half the actual figure, so it’s not entirely implausible that a sample could yield a number three times as high.

The total revenue range is also wider, from $22.7 million to $41.3 million, a range of 1.82 to 1. It’s still a stretch to get to $74 million, but not as much of a stretch. And in this set of samples, the cost per article ranges from $169.22 to $402.89, a range of 2.38 to 1. I should also note that at least one sample shows a mean articles-per-journal figure of 51.5, essentially identical to the Shen/Björk figure, and that $169.22 is similar to the Shen/Björk figure.

Conclusion

Sampling variation with 9% samples could yield numbers as far from the full-survey numbers as those in the Shen/Björk article, although for total article count it’s still a pretty big stretch.

But that article was using closer to 5% samples—and they weren’t actually random samples. Could that explain the differences?

[More to come? Maybe, maybe not.]

PPPPredatory Article Counts: An Investigation, Part 1

Monday, November 9th, 2015

If you read all the way through the December 2015 essay Ethics and Access 2015 (and if you didn’t, you really should!), you may remember a trio of items in The Lists! section relating to “‘Predatory’ open access: a longitudinal study of article volumes and market characteristics” (by Cenyu Shen and Bo-Christer Björk in BMC Medicine). Briefly, the two scholars took Beall’s lists, looked at 613 journals out of nearly 12,000, and concluded that “predatory” journals published 420,000 articles in 2014, a “stunning” increase from 50,000 articles in 2010—and that there were around 8,000 “active” journals that seemed to meet Jeffrey Beall’s criteria for being PPPPredatory (I’m using the short form).

I was indeed stunned by the article—because I had completed a complete survey of the Beall lists and found far fewer articles: less than half as many. Indeed, I didn’t think there were anywhere near 8,000 active journals either—if “active” means “actually publishing Gold OA articles” I’d put the number at roughly half that.

The authors admitted that the article estimate was just that—that it could be off by as much as 90,000. Of course, news reports didn’t focus on that: they focused on the Big Number.

Lars Bjørnshauge at DOAJ questioned the numbers and, in commenting on one report, quoted some of my own work. I looked at that work more carefully and concluded that a good estimate for 2014 was around 135,000 articles, or less than one-third of the Shen/Björk number—and my estimate was based on a nearly 100% actual count, not an estimate from around 6% of the journals.

As you may also remember, Björk dismissed these full-survey numbers with this statement:

“Our research has been carefully done using standard scientific techniques and has been peer reviewed by three substance editors and a statistical editor. We have no wish to engage in a possibly heated discussion within the OA community, particularly around the controversial subject of Beall’s list. Others are free to comment on our article and publish alternative results, we have explained our methods and reasoning quite carefully in the article itself and leave it there.”

I found that response unsatisfying (and find that I’ll approach Björk’s work with a much more jaundiced eye in the future). As I expected, the small-sample report continued (continues?) to get wider publicity, while my near-complete survey got very little.

The situation continued to bother me, because I don’t doubt that the authors did follow appropriate methodology and wonder how the results could be so wrong. How could they come up with more than twice as many active OA PPPPredatory journals and more than three times as many articles?

So I thought I’d look at my own work a little more, to see whether sampling could account for the wild deviation.

First Attempt: The Trimmed List

I began by taking my own copy of Crawford, Walt (2015): Open Access Journals 2014, Beall-list (not in DOAJ) subset. figshare. The keys on each row of that 6,948-row spreadsheet are designed to be random. The spreadsheet includes not only the active Gold OA journals but also 3,673 others, to wit:

  • 2,045 that had not published any articles between 2011 and 2014, including eight that had explicitly ceased.
  • 183 that were hybrid journals, not gold OA.
  • 413 that weren’t really OA by my standards.
  • 279 that were difficult to count (more on those later).
  • 753 that were either unreachable or wholly unworkable.

There were two additional exclusions: I deleted around 1,100 journals (at least 300 of them empty ) from publishers that wouldn’t provide hyperlinked lists of their journal titles—and I deleted journals that are in DOAJ because there were even more reasons than usual to doubt the PPPPredatory label. (Note that the biggest group of that double-listed category, MDPI, has more recently been removed from Beall’s list.)

I wound up with 3,275 active gold OA journals, what I’ll call “secondary OA journals,” since I think of the DOAJ members as “serious OA journals” and don’t have a good alternative term.

As I started reworking the numbers, I thought there should be some accounting for the opaque publishers and journals. In practice, I knew from some extended sampling that most journals from opaque publishers were either empty or very small—and my sense is that most opaque journals (usually opaque because there are no online tables of contents, only downloadable PDF issues, but sometimes because there really aren’t streams of articles as such) are also fairly small. But still, they should be included. Since these two groups (excluding the 300-odd journals from opaque publishers that I knew were empty) added up to 32% of the count of active journals, I multiplied article and revenue counts by 1.32. (I think this is too high, but feel it’s better to err on the side that will get closer to the Shen/Björk numbers.)

I did not factor in the DOAJ-included numbers, but the total of those and other already-counted additional articles (doubling 2014 since I only counted January-June) is around 43,000 for 2014; around 39,000 for 2013; around 37,000 for 2012; and around 28,000 for 2011. You can add them to the counts below if you wish—although I don’t believe these represent questionable articles.

Methodology

Since 613 was the sample size in the Shen/Björk article, I took a similar size sample as a starting point, then adjusted it so I could take five samples that would, among them, include everything: that is, a sample size of 655 journals.

For each sample (sorting by the pseudorandom key, then starting from the beginning and working my way down), I took the article count for each year, multiplying by appropriate factors, and the revenue counts for 2013 and 2014 (determined by multiplying the 2014 APC by the annual article counts, then applying the appropriate multipliers—I didn’t go back before 2013 because APCs were too likely to have changed). I calculated average APC per article for 2014 and 2013 by straight division—and also calculated the average article count (not including zero-count journals because the cells were blank rather than zero) and median article count (also excluding zero-count journals). I also calculated standard deviation just for amusement.

“Zero-count journals? Didn’t you eliminate zero-count journals?” I eliminated journals that had no articles in any year 2011-2014, but quite a few journals have articles in some years and not in others—including, of course, newish journals. For example, there were only 2,393 journals with articles in the first half of 2014; 2,714 in 2013; 1,557 in 2012 and 996 in 2011.

I also calculated the same figures for the full set.

Looking at the results, I was a little startled by the wide range, given that these samples were 20% of the whole: the 2014 projected article totals (doubling actual article counts, of course) ranged from 5,755 to 180,229! Now, of course, even that highest count is still much less than half of the Shen/Björk count—and just a bit over half if you add in the DOAJ-listed count.

So I added another column and assigned a random number to each row, using Excel’s RAND function, then froze the results and took a new set of five samples. The results were much narrower in range: 99,713 to 136,660. The actual total: 121,311 (including the 1.32 multiplier but not DOAJ numbers).

Table 1 shows the projected (or actual) article totals year-by-year and sample-by-sample, sorted so the lowest 2014 projection appears first. Note that samples 1-5 use the assigned pseudorandom keys, while samples 6-10 use Excel RAND function for randomization. Clearly, the latter yields more plausible results.

Sample 2014 2013 2012 2011
4 5,755 21,734 15,959 10,223
5 91,067 85,734 66,594 51,473
8 99,713 84,797 55,209 33,733
7 115,368 91,964 57,664 27,595
Total 121,311 99,994 64,325 34,543
6 123,050 104,808 57,295 22,605
9 131,762 106,181 82,790 53,869
10 136,660 112,220 68,666 34,914
3 159,284 121,097 75,933 27,628
1 170,148 138,890 87,371 56,027
2 180,299 132,515 75,768 27,364

Table 1. Estimated article counts by year

Adding the 43,000-odd articles from DOAJ-listed journals would bring these totals (ignoring samples 1-5) to around 143,000 to around 180,000 articles, with the most likely value around 165,000 articles: more than one-third of the Shen/Björk estimate but a lot less than half.

Note that “120,000 plus or minus 25,000” as an estimate actually covers all five samples that used the RAND-function randomization. Figure 1 shows the same data as Table 1, but in graphic form.

Figure 1. Estimated article counts by year

How much revenue might those articles have brought in, and what’s the APC per article? Keeping the order of samples the same as for Table 1 and Figure 1, Table 2 and Figure 2 show the maximum revenue (not allowing for waivers and discounts).

Sample 2014 2013
4 $2,952,893 $10,473,269
5 $1,677,496 $3,322,988
8 $30,184,480 $23,906,771
7 $35,939,416 $35,825,909
Total $31,863,087 $28,537,554
6 $31,010,206 $27,926,897
9 $31,165,754 $29,071,218
10 $31,015,578 $25,956,975
3 $82,610,167 $65,930,614
1 $34,247,360 $32,892,328
2 $37,827,517 $30,068,570

Table 2. Estimated maximum revenue, 2014 and 2013

This time there are two extremely low figures and one extremely high figure—with samples 6 through 10 all within $4.1 million of the actual maximum figure (for 2014: for 2013, the deviation is $7.3 million). Compare the $31.86 million calculated costs here with the $74 million estimated by Shen/Björk: the full-survey number is less than half as much.

Figure 2 shows the same information in graphical form.

Figure 2. Estimated maximum revenue, 2014 and 2013

Looking at APC per article, we run into an anomaly: where the Shen/Björk estimate is $178 for 2014, the calculated average for the full survey is considerably higher, $262.66. The range of the ten samples is from a low of $18.42 to a high of $513.08, but the five “good” samples range from $226.95 to $302.71, a reasonably narrow range.

Finally, consider the mean (average) number of articles per journal in 2014, in journals that had articles. The Shen/Björk figure is around 50; my survey yields 36.8. In fact, I show only 327 journals with at least 25 articles in the first half of 2014 (and only 267 with at least 50 articles in all of 2013).

The median is even lower—12 articles, or six in the first half—and that’s not too surprising. The standard deviation in most years was at least twice the average: as usual, these journals are very heterogeneous. How heterogeneous? In the first half of 2014, three journals had more than 1,000 articles each (but fewer than 1,300); six more had at least 500 articles; 16 had 250 to 499 articles—but at the same time, only 819 of the total had at least 11 articles in the first half of 2014, and only 1,544 had at least five articles in those six months.

Conclusion

I could find no way to get from these samples to the Shen/Björk figures. Not even close. They show too many active journals by roughly a factor of two, too many articles by a factor of close to three, and too much revenue by a factor of two—and too many articles per journal as well.

[Part 1 of 2 or 3…]

Note: This and following posts will also appear, probably in somewhat revised form, in the January 2016 issue of Cites & Insights.

Gunslinger Classics Disc 12

Saturday, November 7th, 2015

As usual for these 12-disc fifty-movie sets, one disc has six short movies: this one. These are all oaters, B-movie programmers of an hour or less, mostly low-budget short-plot flicks. Four with John Wayne; one each with Bob Steele and Crash Corrigan.

Texas Terror, 1935, b&w. Robert N. Bradbury(dir. & screenplay), John Wayne, Lucile Browne, LeRoyMason, Ferm Emmett, George Hays. 0:51.

Wayne’s the newly-elected sheriff. The man who pretty much raised him comes by the office, shows the wad of cash he’s withdrawn from Wells Fargo to restock his ranch now that his daughter’s coming home in a few months, notes that he’d tied his horse up behind Wells Fargo, and rides off. Almost immediately thereafter, three gunmen rob Wells Fargo; in chasing them, Wayne winds up in a shootout with results that make him believe (a) that he—Wayne—shot the old man (we know it was one of the gunmen) and (b) that the old man might have been one of the bandits, since they dumped the money bag and one wad of bills on his corpse. After the town (jury?) concludes that the old man
had to have been a bandit—after all, people saw him tie up his horse behind Wells Fargo—Wayne resigns his position, turning it back over to the old sheriff (George Hayes, not in the Gabby persona). Wayne goes off, grows a beard, and becomes…well, that’s not clear.

Lots’o’plot, much of it involving the daughter, and most of it makes just as much sense as the idea that Wayne wouldn’t mention during the court hearing that the old man had told him his horse was tied up where it was. But hey, if you like lots of riding, some shooting, and a band of friendly Indians saving the day, I guess it’s OK. Generously, $0.75.

Wildfire, 1945, color. Robert Tansey (dir.), Bob Steele, Sterling Holloway, John Miljan, Eddie Dean. 0:59

An unusual entry: late (1945) and in color, but still a one-hour flick with lots of riding, lots of shooting, a couple of good fights—and a singing cowboy (actually sheriff in this case, Eddie Dean) who gets the girl. The plot, not in the order it unfolds: a gang is rustling all the horses from ranches in one valley and blaming it on Wildfire, a wild stallion—and it turns out horse theft is a sideline: the motivation is for one gang member to buy up the ranches cheap, since he already has a contract to sell them to a big ranch for a big profit. Two itinerant horse-traders with a tendency to stay on the right side of the law wind up in the middle of this and expose it.

The color’s a little faded, but the whole thing’s good enough that I’d probably give it six bits—except for one thing: however they “digitized” this, at several points it looks like a projector losing its grip on film sprockets, losing chunks of the action and disrupting continuity. With that, it goes down to $0.50.

Paradise Canyon, 1935, b&w. Carl Pierson (dir.), John Wayne, Marion Burns, Reed Howes, Earle Hodgins, Gino Corrado, Yakima Canutt. 0:53.

John Wayne again, this time as a government agent sent to investigate counterfeit traffic that may be connected to a medicine show. (One person went to jail for ten years for counterfeiting, and may be running such a show.) He finds the show—which has a habit of leaving towns suddenly, either for not paying debts or because the proprietor tends to drink his own tonic, go to town, bust things up and not pay for them (his tonic is “90% alcohol,” which is 180 proof and should make it flammable). For that matter, he helps the show evade arrest by getting them across the Arizona/New Mexico border just ahead of the law, and joins the show as a sharpshooter.

The next town is a New Mexico/Mexico border town—and turns out the medicine show’s not really involved any more: instead, the counterfeiter, who framed the medicine man, is now operating out of a saloon on the Mexican side. One thing leads to another with lots of riding, lots of shooting and some true sharpshooting, and of course both the good guys winning and John Wayne getting the girl—with a mildly cute surprise ending.

The highlight is probably the medicine man’s pitch, a truly loopy piece of speechifying, including his assurance that he once knew a man without a tooth in his head…and that man became the best bass drum player he ever knew! All it takes is determination, and Doc Carter’s Famous Indian Remedy.

Not great, not terrible. Once again we have Yakima Canutt doing something more than trick riding—he’s the villain in the piece. (Wayne does not sing; the two singing entertainers in the medicine show are…well, that’s six minutes I’ll never get back again.) I’ll give it $0.75

The Lucky Texan, 1934, b&w. Robert N. Bradbury (dir. & writer), John Wayne, Barbara Sheldon, Lloyd Whitlock, George Hayes, Yakima Canutt. 0:55.

This time, John Wayne’s Jerry Mason just out of college and returned to the ranch of old geezer Jake Benson, who more or less brought him up—and finds that the ranch’s cattle have all been rustled, but Benson’s opening up a blacksmith shop in town. Wayne immediately starts working there, and an early customer’s horse had picked up a stone—a stone that, when Wayne looks at it, seems to have gold in it. (It must have been a thriving smithy, since the geezer refuses payment for dealing with the horse’s problem…) Oh, and Benson’s pretty young granddaughter’s about to finish college (thanks in part to the geezer’s monthly checks) and returning soon.

One thing leads to another, and we have Wayne and Benson (not a TV series, but it could be) getting really good pure gold out of the site where they figured the horse had been; when they go to sell it, the assayer pays them…and then notes to his sidekick that he now “owned” most of Benson’s cattle.

More plot; the villains trick the geezer into signing a deed to the ranch; the sheriff’s son shoots the banker in a holdup just after Benson pays off the loan for the blacksmith shop (and Benson seems like a likely culprit until John Wayne Saves the Day)…and more. As always, it all works out in the end, which involves the usual Wayne-and-the-girl wedding. No singing; lots of fist fights (with no phony sounds—lots of grunting, but not much more); oddly enough, although two men are shot (and two others are shot at), there’s not a single death in the movie. There is, on the other hand, Wayne surfing down a sluice riding on a tree branch—and a chase scene involving Hayes semi-driving a car (he’d never driven before) and the villains on a powered railway car, in an almost slapsticky sequence. (That long chase is also the only time in an old Western I’ve ever seen The Hero, Wayne in this case, jump from his horse to tackle the villain on his horse…and miss, tumbling down a hill.)

George Hayes gets to show his dramatic abilities pretending to be his sister (you’d have to see it—he’d played the lead in Charley’s Aunt many years before, and does a good job in drag), and although he now has Gabby Hayes’ intonation and look, he’s not playing the fool by any means, and not even the sidekick—after all, it’s his ranch and his blacksmith shop. Another one with Yakimah Canutt doing more than stunt riding (although he did plenty of that—apparently chasing himself at one point), once again playing a bad guy (something he was very good at). (I would note that many of the reviews at IMDB call George Hayes “Gabby” or “Gaby” Hayes—but he didn’t become Gabby Hayes until later on in his career.)

Maybe I’m getting soft as I near the end of this marathon, but this one seemed pretty good; I’ll give it $1.

Riders of the Whistling Skull, 1937, b&w. Mack V. Wright (dir.), Robert Livingston, Ray Corrigan, Max Terhune, Mary Russell, Roger Williams, Yakima Canutt, Fern Emmett, Chief Thundercloud. 0:58 [0:53]

A few archaeologists and a trio of cowboys known as The Three Mesquiteers are out to plunder a lost Indian city, or as they put it, rediscover it and recover all the golden treasure. A bunch of Native Americans don’t like this idea, and attempt to discourage them. One half-Native American, who passes himself off as one of the party, had previously kidnapped the father of the beautiful young (female) anthropologist and has been torturing him to reveal the location of the treasure.

Of course, this being a B Western from the 1930s, the plunderers are the heros and it’s a great thing that they manage to shoot at least half a dozen Native Americans and bury more of them under a wildly implausible collapse of half a mountain. Naturally, it all ends “well,” with the most handsome of the Mesquiteers getting the girl and an older and plainer woman (another sort-of archaeologist) getting the less handsome of the Mesquiteers. (In this one, Yakima Canutt plays the American Indian guide who’s in cahoots with the half-Native American.)

Reasonably well staged and with continuous action, but it’s also blatantly offensive. If you can ignore that, maybe $0.75.

Randy Rides Alone, 1934, b&w. Harry L. Fraser (dir.), John Wayne, Alberta Vaughn, George Hayes, Yakima Canutt, Earl Dwire. 0:53.

This cowboy riding along tops a ridge and spots the roof of a building—a halfway house saloon. He hears the honky-tonk piano and goes in…only to discover that everybody’s dead and the piano is a player piano. As he looks over the situation, including an open safe, the sheriff and his posse show up…and, naturally enough, arrest the cowboy. But we saw eyes moving in a painting on the wall…and after they’ve gone, a young woman steps out and inspects the scene.

Thus begins a story involving a hearing mute who runs a local store, the young woman breaking the cowboy out of jail so he can find the real killers, a gang hideaway for a gang run by…oh, let’s not give it all away. Lots of riding, a fistfight or two, some shooting, and of course all ends well. This time, George Hayes (not at all in the “Gabby” persona) plays the lead villain (and the—spoiler—mute shopkeeper) and Yakima Canutt plays the chief henchman.

The flick seems padded at 53 minutes, and Wayne is notable mostly for his young good looks. Generously, $0.75.

Double digits!

Friday, November 6th, 2015

I am delighted to say that The Gold OA Landscape 2011-2014 is now in the double digits, with two Ingram paperback sales and one Amazon paperback sale reported. (I’m guessing that I only see Ingram and Amazon numbers once a month. In terms of progress toward $ goals, three Ingram/Amazon sales equal one 1.3 Lulu sales, but I’m nonetheless delighted to see them.)

The balance still heavily favors print: ten paperback http://www.lulu.com/shop/walt-crawford/the-gold-oa-landscape-2011-2014/ebook/product-22353903.htmlcopies, two PDF site-licensed ebooks. (The ebooks are only available through Lulu because the global marketing channel will only accept ePub ebooks. Don’t ask me.)

Added a bit later: And thanks to worldcat.org, I see that five universities have the book–and that it’s available from Barnes & Noble as well. I think Ingram, B&N and Amazon are the totality of Lulu’s global marketing arrangements…