GOA7 Pass 2: Updating

Friday, April 22nd, 2022

I’ll add to this post as I progress through Pass 2…

April 22: Parts 1 and 2

Part 1 (adding possible 2021 articles to journals that had none) and Part 2 (adding later 2021 issues to journals that seemed as though they should have more) have both been completed, adding around 3,500 2021 articles and increasing the count of journals with 2021 articles by around 100.

The key numbers now, excluding Parts 3 and 4 of Pass 2, are:

Journals that won’t be scanned further: 14,472.

Journals with 2021 articles: 14,675

2021 articles: 1,233,706.

April 22 (2): Journals removed

I reconsidered when journals no longer in DOAJ should be removed, doing this for Part 3 and Part 4 just now. (I went back to June 2021 for removal dates, but in fact all journals removed were done in 2022.)

In all, 62 journals were removed from Part 3 (xx), leaving 1,044 to be rechecked, and 15 were removed from Part 4 (xm), leaving 659 to be rechecked. These 77 journals–marked “xo”–will not be included in the study, although they may be included in one table in Chapter 2 (Exclusions and Special Cases).

April 30: Part 3 complete

The 1,044 xx journals have been rechecked, with reasonably good success. Although there were 312 more cases than in last year’s scan, the number that couldn’t be resolved only increased from 126 to 147. Of the remainder, 177 were fine when retested (which usually means temporary server problems); 41 were either fine or found on an alternate path but hadn’t published since 2019; 651 were found and counted using alternative routes; 10 were dead/duplicates; three aren’t OA journals (two now require login); and 35 were no longer in DOAJ. This was a “lumpy” pass: 223 xx cases came from Sciendo, 80 arose from the DergiPark move from .gov to .org, and 55 came from SciELO instances where URLs hadn’t been updated.

The current totals for non-problematic journals: 16,374 total; 15,446 with 2021 articles; 1,268,018 2021 articles; 385 with no post-2019 articles; 89 dead/duplicate cases (no articles since 2015). Of all these, 2,162 appear to be new to DOAJ and 14,212 are comntinuing.

Next step: Part 4 (xm), and a quick recheck on some items. Yes, I’m about 10 days ahead of last year. Cross fingers.

May 3: Part 4(a)

I’ve gone through the 658 xm (malware) journals, with –as expected–modest results: 5o now active, 5 OK but code bi (inactive since 2019), one bx (found through a different url), one xd (dead/duplicate, and one x0 (no longer in DOAJ).

At the moment, 16,431 journals are ready for processing, 15,496 of which have 2021 articles; there are 1,269,933 2021 articles,

The remaining 595 xm (malware) journals will have codes compared with those in last year’s study, as will the xx/xm journals previously double-checked: last year’s codes help inform whether journals are included in the full study or held out as exclusions (which appear on a separate page of the eventual Figshare spreadsheet). Then, 601 xm journals that weren’t also xm/xx last year will be checked for the possibility of an alternate route. I’d be pleasantly surprised to find many, but it’s worth the two or three days required.

A couple of notes about the current malware group (excluding 10 that changed from xx to xm in the last phase):

Big clusters by publisher include 37 from Universitas Udayana; 29 from Conselho Nacional de Pesquisa e Pós-graduação em Direito (CONPEDI); 28 from Universitas Negeri Malang; 19 from from Diponegoro University; and 14 from Universitas Pendidikan Indonesia.

You may notice something about all but one of those names–and, indeed, breaking down the malware by country shows 403 from Indonesia–just over two-thirds of the total–plus 69 from Brazil and 28 from Ukraine. No other country has more than eight.

May 5: Completion of online scans

I’ve now rechecked xm journals looking for possible alternate URLs and a few other checks, with some success–and gone through the complete dataset getting rid of pure duplicates (either from downloading issues or otherwise).

While these numbers may change very slightly as I do consistency checks in the next day or two, I’d guess such changes will be very small–probably less than 1%. I also have a more nuanced understanding of the malware and problematic issues, and it’s encouraging. I’ll lay that out in a separate post–but if you just want the biggest numbers, the final report is likely to include around 16,726 journals (with another 440 on an exclusions page), of which around 15.643 have 2021 articles, with a total of around 1,275,080 2021 articles. All figures subject to change.

GOA7: First pass complete

Wednesday, April 20th, 2022

I’ve finished scanning the 17,302 journals for Gold Open Access 2016-2021.

At the end of that pass, there are 14,572 journals with 2021 articles recorded, for a total of 1,231,397 2021 articles.

But, as usual, there were a lot of problematic journals–1,106 that were either unavailable or not working properly, and 674 with malware or security-certificate issues. These will all be revisited, as will 445 journals that showed no 2021 articles (but no signs of problems) and 225 where at least one 2021 article appeared but it seemed likely that there should have been more.

Oddly enough, I completed Pass 1 on the same date (April 20) as last year, despite checking just under 2,000 more journals. I credit that to fewer emergencies (so far), consistently good broadband and computer performance, restoring use of a direct Excel-to-browser function that had stopped working, and more consistency in many journal webpages. (I also tweeted every day on progress, mostly as a personal goad. it worked.)

Comparing this year’s Pass 2 to last year’s:

  • Last year, I checked 2,211 journals for possible added articles; this year, I’ll be checking 445 that had no 2021 articles and 225 that seem likely to have more. That really compares to the 946 last year flagged during the pass: I was keeping track of comparable numbers and saw no reason to do another algorithmic pass. (See p. 230-231 of GOA6 if you want to know what that’s about.) This scan goes rapidly; I’d hope for considerably less than a week.
  • I’ll be deleting problematic journals removed from DOAJ since 1/1/2022 after the remaining pieces rather than before: that should not affect more than 30-40 journals, and since the intent is to be an “end of 2021” snapshot, it seems reasonable.
  • The scan for “xx”–unavailable or unworkable–will involve 1,106 journals, much worse than last year’s 732. Quite a few of these are 404s because Dergisi Park (Turkey) stopped autoforwarding from its old domain to its new .org domain;  a few more are because of an oddity with one SciELO instance that means if you already have browser tabs open for two SciELO journals, it rejects any other attempts. Those can all be fixed, and I hope to clear up several hundred of the xx cases (some clear themselves up–e.g., one university’s server was apparently down on one day). This may be a slow process (the 732 took a week).
  • The scan for “xm” (malware and certificate issues) will involve 674 journals, slightly better than last year’s 781, but still about 674 too high. That process, and additional checking for recalcitrant “xx” cases, may take a while. Last year, I completed the final scan on May 19; I’ll be delighted if I do as well this year. After that comes a few days of data normalization and about a month to prepare the book and mount the dataset at figshare.

So, well, no real target date, but if emergencies continue to be few and mild, the data and main book might–might–be ready in June. (The country book, which will be very different this year because it will focus on the “long tail,” journals not published by one of the Big 9 or 10, would be ready a few weeks later.)

I may continue to tweet progress reports (I’m always waltcrawford), probably not every day. And if you or your institution want to encourage the continuation of this series, consider buying one or all of the trade paperbacks at I won’t get rich (they’re priced by rounding production  cost up to the next 50 cent mark), but I spend a lot of care on making sense of the data and think the print book is a good way to see what I’ve found. But, of course, there will also be a free PDF of each book at my website and a free dataset at figshare, both CC-BY.

Now, to start Pass 2.

Oh: my prediction for overall article count is “probably around 1.3 million”–that is, somewhere between 1.23 million (no articles added in Pass 2: very unlikely) and a few tens of thousands.

Incidentally: this scan included 15,055 journals that continued from previous years and 2,247 added to DOAJ in 2021 (most them not new that year). As always, a few hundred journals disappeared–and all but about 120 were explicitly removed from DOAJ during 2021.