GOA7: Quick update

May 20th, 2022

I’m making good progress on GOA7–the book. Barring huge disruptions in the next few weeks–not a safe bet–it should be ready (and the dataset published) in early June. Possibly even very late May, but don’t quote me on that. (Then, after resting for a couple of days, comes the subject book.)

I don’t include political commentary in the book, but not because I don’t have feelings, politics in this case being the politics of OA publishing and funding. So there won’t be any notes on the subversion of the OA vision by Big Publishers, even if the potential revenue from author-side fees did increase by nearly half a billion dollars from 2020 to 2021 (to roughly one and three quarters billion).

And “Big Publishers” is a tricky term in this case: MDPI, not a traditional publisher at all, appears to have taken in around $540 million in 2021–up more than $200 million from 2020 (partly by publishing a lot more articles, partly by a $255 increase in average cost per article). MDPI now publishes more DOAJ-listed OA articles than all of the Holtzbrinck Group (Springer, Nature, Frontiers, BMC).

But the book is, as usual, mostly lots of tables and graphs with limited commentary–describing what is, not what I think it should be.

Enough for now. Back to the book.

GOA7: Preliminary baseline

May 5th, 2022

I believe I’ve now completed the online work for Gold Open Access 2016-2021 (GOA7), to be followed by a day or three of consistency/typo checking, a few days of adding data (persistent DOAJ urls for ongoing work, GOA6 fees and status for comparisons, and various columns of derived data), and several weeks of massaging data and preparing the book. Current hope is mid- to late June for the main book and figshare dataset, a few weeks later for the new “long tail” country book. I’m nearly certain the main book will not be ready in May, and it’s possible that emergencies and problems could push it into July, but “sometime in June” is probable.

So where do things stand, with the understanding that consistency checks may cause numbers to shift very slightly?

Refining Problematic-Journal Coding

Last year, the xm (malware) and xx (unavailable/unworkable) codes included journals with the same problem for two or more years, which were excluded, and those where it was new, which were included.

This year, I refined the coding–adding a few new codes, all of which result in exclusion from the overall study:

  • x2: xm in one year, xx in another. One journal, no 2021 articles.
  • xm2: Malware this year and last. 383 journals (of which 47 come from Brazil, 276 from Indonesia, and 23 from Ukraine), of which DOAJ says 193 had 2021 articles, a total of 5,367 2021 articles.
  • xmi: Malware this year and no articles later than 2019. Nine journals.
  • xo: No longer in DOAJ. 119 journals and problematic in some other way.
  • xx2: unavailable/unworkable this year and last. Twenty journals, two with 2021 articles (22 articles).
  • xxi: unavailable and with no DOAJ-listed articles since 2019. 27 journals.

So the excluded page in the eventual Figshare spreadsheet will include 658 journals (including 89 xd and 10 non-OA journals)–about 160 more than last year, but 119 of those are no longer in DOAJ, so this is actually an improvement.

The most encouraging thing is that there are relatively few new malware cases: 142 in all, compared to 260 last year. Of the 142, 96 are from Indonesia; no other country has more than five. There are slightly more unavailable/unworkable cases (90 compared to 75), but that’s not bad.

The Baseline

Subject to small further refinement, here’s what I see, by code:

Journals 2021 content 2021 articles
a 15,305 14,876 1,242,250
bi 391
bx 699 666 29,096
xm 142 85 2,600
xx 90 16 1,124
Total 16,627 15,643 1,275,070

Again, subject to refinement…but probably not major changes. Compares to last year’s 15,128 fully analyzed journals and 1,061,256 2020 articles.

GOA7 Pass 2: Updating

April 22nd, 2022

I’ll add to this post as I progress through Pass 2…

April 22: Parts 1 and 2

Part 1 (adding possible 2021 articles to journals that had none) and Part 2 (adding later 2021 issues to journals that seemed as though they should have more) have both been completed, adding around 3,500 2021 articles and increasing the count of journals with 2021 articles by around 100.

The key numbers now, excluding Parts 3 and 4 of Pass 2, are:

Journals that won’t be scanned further: 14,472.

Journals with 2021 articles: 14,675

2021 articles: 1,233,706.

April 22 (2): Journals removed

I reconsidered when journals no longer in DOAJ should be removed, doing this for Part 3 and Part 4 just now. (I went back to June 2021 for removal dates, but in fact all journals removed were done in 2022.)

In all, 62 journals were removed from Part 3 (xx), leaving 1,044 to be rechecked, and 15 were removed from Part 4 (xm), leaving 659 to be rechecked. These 77 journals–marked “xo”–will not be included in the study, although they may be included in one table in Chapter 2 (Exclusions and Special Cases).

April 30: Part 3 complete

The 1,044 xx journals have been rechecked, with reasonably good success. Although there were 312 more cases than in last year’s scan, the number that couldn’t be resolved only increased from 126 to 147. Of the remainder, 177 were fine when retested (which usually means temporary server problems); 41 were either fine or found on an alternate path but hadn’t published since 2019; 651 were found and counted using alternative routes; 10 were dead/duplicates; three aren’t OA journals (two now require login); and 35 were no longer in DOAJ. This was a “lumpy” pass: 223 xx cases came from Sciendo, 80 arose from the DergiPark move from .gov to .org, and 55 came from SciELO instances where URLs hadn’t been updated.

The current totals for non-problematic journals: 16,374 total; 15,446 with 2021 articles; 1,268,018 2021 articles; 385 with no post-2019 articles; 89 dead/duplicate cases (no articles since 2015). Of all these, 2,162 appear to be new to DOAJ and 14,212 are comntinuing.

Next step: Part 4 (xm), and a quick recheck on some items. Yes, I’m about 10 days ahead of last year. Cross fingers.

May 3: Part 4(a)

I’ve gone through the 658 xm (malware) journals, with –as expected–modest results: 5o now active, 5 OK but code bi (inactive since 2019), one bx (found through a different url), one xd (dead/duplicate, and one x0 (no longer in DOAJ).

At the moment, 16,431 journals are ready for processing, 15,496 of which have 2021 articles; there are 1,269,933 2021 articles,

The remaining 595 xm (malware) journals will have codes compared with those in last year’s study, as will the xx/xm journals previously double-checked: last year’s codes help inform whether journals are included in the full study or held out as exclusions (which appear on a separate page of the eventual Figshare spreadsheet). Then, 601 xm journals that weren’t also xm/xx last year will be checked for the possibility of an alternate route. I’d be pleasantly surprised to find many, but it’s worth the two or three days required.

A couple of notes about the current malware group (excluding 10 that changed from xx to xm in the last phase):

Big clusters by publisher include 37 from Universitas Udayana; 29 from Conselho Nacional de Pesquisa e Pós-graduação em Direito (CONPEDI); 28 from Universitas Negeri Malang; 19 from from Diponegoro University; and 14 from Universitas Pendidikan Indonesia.

You may notice something about all but one of those names–and, indeed, breaking down the malware by country shows 403 from Indonesia–just over two-thirds of the total–plus 69 from Brazil and 28 from Ukraine. No other country has more than eight.

May 5: Completion of online scans

I’ve now rechecked xm journals looking for possible alternate URLs and a few other checks, with some success–and gone through the complete dataset getting rid of pure duplicates (either from downloading issues or otherwise).

While these numbers may change very slightly as I do consistency checks in the next day or two, I’d guess such changes will be very small–probably less than 1%. I also have a more nuanced understanding of the malware and problematic issues, and it’s encouraging. I’ll lay that out in a separate post–but if you just want the biggest numbers, the final report is likely to include around 16,726 journals (with another 440 on an exclusions page), of which around 15.643 have 2021 articles, with a total of around 1,275,080 2021 articles. All figures subject to change.

GOA7: First pass complete

April 20th, 2022

I’ve finished scanning the 17,302 journals for Gold Open Access 2016-2021.

At the end of that pass, there are 14,572 journals with 2021 articles recorded, for a total of 1,231,397 2021 articles.

But, as usual, there were a lot of problematic journals–1,106 that were either unavailable or not working properly, and 674 with malware or security-certificate issues. These will all be revisited, as will 445 journals that showed no 2021 articles (but no signs of problems) and 225 where at least one 2021 article appeared but it seemed likely that there should have been more.

Oddly enough, I completed Pass 1 on the same date (April 20) as last year, despite checking just under 2,000 more journals. I credit that to fewer emergencies (so far), consistently good broadband and computer performance, restoring use of a direct Excel-to-browser function that had stopped working, and more consistency in many journal webpages. (I also tweeted every day on progress, mostly as a personal goad. it worked.)

Comparing this year’s Pass 2 to last year’s:

  • Last year, I checked 2,211 journals for possible added articles; this year, I’ll be checking 445 that had no 2021 articles and 225 that seem likely to have more. That really compares to the 946 last year flagged during the pass: I was keeping track of comparable numbers and saw no reason to do another algorithmic pass. (See p. 230-231 of GOA6 if you want to know what that’s about.) This scan goes rapidly; I’d hope for considerably less than a week.
  • I’ll be deleting problematic journals removed from DOAJ since 1/1/2022 after the remaining pieces rather than before: that should not affect more than 30-40 journals, and since the intent is to be an “end of 2021” snapshot, it seems reasonable.
  • The scan for “xx”–unavailable or unworkable–will involve 1,106 journals, much worse than last year’s 732. Quite a few of these are 404s because Dergisi Park (Turkey) stopped autoforwarding from its old .gov.tr domain to its new .org domain;  a few more are because of an oddity with one SciELO instance that means if you already have browser tabs open for two SciELO journals, it rejects any other attempts. Those can all be fixed, and I hope to clear up several hundred of the xx cases (some clear themselves up–e.g., one university’s server was apparently down on one day). This may be a slow process (the 732 took a week).
  • The scan for “xm” (malware and certificate issues) will involve 674 journals, slightly better than last year’s 781, but still about 674 too high. That process, and additional checking for recalcitrant “xx” cases, may take a while. Last year, I completed the final scan on May 19; I’ll be delighted if I do as well this year. After that comes a few days of data normalization and about a month to prepare the book and mount the dataset at figshare.

So, well, no real target date, but if emergencies continue to be few and mild, the data and main book might–might–be ready in June. (The country book, which will be very different this year because it will focus on the “long tail,” journals not published by one of the Big 9 or 10, would be ready a few weeks later.)

I may continue to tweet progress reports (I’m always waltcrawford), probably not every day. And if you or your institution want to encourage the continuation of this series, consider buying one or all of the trade paperbacks at lulu.com. I won’t get rich (they’re priced by rounding production  cost up to the next 50 cent mark), but I spend a lot of care on making sense of the data and think the print book is a good way to see what I’ve found. But, of course, there will also be a free PDF of each book at my website and a free dataset at figshare, both CC-BY.

Now, to start Pass 2.

Oh: my prediction for overall article count is “probably around 1.3 million”–that is, somewhere between 1.23 million (no articles added in Pass 2: very unlikely) and a few tens of thousands.

Incidentally: this scan included 15,055 journals that continued from previous years and 2,247 added to DOAJ in 2021 (most them not new that year). As always, a few hundred journals disappeared–and all but about 120 were explicitly removed from DOAJ during 2021.

GOA7: Three-quarters progress report

March 26th, 2022

I’ve scanned 13,000 journals so far–just over three-quarters of them all. (There are 4,302 left to do. Stopping at 12,900 would have left 4,402, just over one-quarter.)

At a similar point in last year’s scan (arranged by publisher and journal), there were 11,386 journals. I show 1,790 newly-added journals so far, so that suggests about 176 removed or missing. [Journals change publishers and thus locations in the spreadsheet, so removed/added figures can change either way.]

The 2021 article count to date is 1,069,316. The last quarter of journals tends to have many fewer articles: there were just under 157,000 2020 articles in the remaining group. So it’s fair to assume that there will be more than 1.2 million 2021 articles, but maybe not a lot more–I’ll stick with 1.3 to 1.4 million as a vague guesstimate.

I’ve been providing daily summaries of journals counted, total to date, total with 2021 articles, and 2021 article count on my Twitter account–not difficult to find! (I’m boring: pretty much walt crawford everywhere…)

I’ve also been providing weekly summaries including counts of problematic or special cases, which so far include 251 inactive (no articles since 2019 or earlier, but at least one since 2015); 15 found at different URLs (there will be a lot more of these in the second pass); 66 dead or duplicate; 379 malware cases; seven that I don’t believe are OA journals; and 856 that were unreachable or unworkable. I would anticipate clearing most of the unreachable/unworkable–and so far, cross fingers, malware cases aren’t as numerous last year (but still precisely 379 cases too high, since there’s no excuse for any of them). My optimistic target was to reach 12,900 or 13,000 by the end of March, and I’ve done better than that. Tomorrow will be devoted to other things, and there’s still enormous uncertainty about outside factors–but if things continue to go well, I could start the second pass before May.

 

GOA7: First half

February 24th, 2022

I’ve scanned 8,600 journals so far–just less than one-half of them all. (There are 8,702 left to do.)

At a similar point in last year’s scan (arranged by publisher and journal), there were 7,358 journals. I show 1,314 newly-added journals so far, so that suggests about 72 removed or missing. [Journals change publishers and thus locations in the spreadsheet, so removed/added figures can change either way.]

Newly-added journals aren’t mostly brand-new, so the 2020-2017 comparative figures are reasonable:

2020201920182017
GOA7730,902576,015499,897425,273
GOA6707,832566,270487,810410.182

The 2021 article count to date is 870,382: A couple of publishers in the second quarter show rapid article growth in 2021. Note that the 2017-2020 counts for GOA7 are all probably low, but not by a lot.

I’ve been providing daily summaries of journals counted, total to date, total with 2021 articles, and 2021 article count on my Twitter account–not difficult to find! (I’m boring: pretty much walt crawford everywhere…)

I’ve also been providing weekly summaries including counts of problematic or special cases, which so far include ..161 inactive journals (no articles since 2019); 48 dead/duplicate titles; 243 journals with malware; 4 items I don’t consider to be OA journals; and 402 that couldn’t be reached for one reason or another. (Most of the latter count will probably be rectified in the follow-up pass, including the journals left behind when DergiPark in Turkey changed domains and stopped forwarding.)

I’m still ahead of schedule, and of course hope to continue that, but things can happen. There will be an immediate slowdown over the next few days as I deal with other stuff–arranging papers, gathering tax receipts, weeding (real weeding), etc. I’ll probably do updates every two or three days for a week or so, as I’ll be doing less journal scanning each day.

Still no anticipated finish date (barring major family health issues, early May is possible for the first pass) or total article count (but I’d guess somewhere between 1.3 and 1.4 million).

GOA7: Progress Report, First Quarter

January 29th, 2022

I’ve scanned 4,320 journals so far–just less than one-quarter of them all. (There are 12,983 left to do.)

At a similar point in last year’s scan (arranged by publisher and journal), there were 3,752 journals. I show 699 newly-added journals so far, so that suggests about 131 removed or missing.

Newly-added journals aren’t mostly brand-new, so the 2020-2017 comparative figures are reasonable:

2020201920182017
GOA7289,125238,532216,973197,314
GOA6274,902229,221212,892186,547

The 2021 article count to date is 312,150.

I’ve been providing daily summaries of journals counted, total to date, total with 2021 articles, and 2021 article count on my Twitter account–not difficult to find! (I’m boring: pretty much walt crawford everywhere…)

I’ve also been providing weekly summaries including counts of problematic or special cases, which so far include 70 inactive journals (no articles since 2019); 20 dead/duplicate titles; 130 journals with malware; 3 items I don’t consider to be OA journals; and 176 that couldn’t be reached for one reason or another. (Most of the latter count will probably be rectified in the follow-up pass, including the journals left behind when DergiPark in Turkey changed domains and stopped forwarding.)

Yes, I’m slightly ahead of “schedule.” It’s possible that I could finish the first scan around the end of April, but it could take much longer both because of varying degrees of difficulty and because of real-life interruptions, the scheduled (taxes, for example) and the unscheduled (family health situations). I have found some slightly faster ways to do some things, which is encouraging,

GOA6: Usage through 1/5/22

January 6th, 2022

As of January 5, 2022, as far as I can tell:

GOA6:

  • Overall report: 1,303 PDF copies (no books other than my copy)
  • Countries: 134 PDF (no books)
  • Dataset: 358 views, 62 downloads

GOA5:

  • Overall report: 910 copies (two books)
  • Countries: 211 copies (no books)
  • Dataset: 946 views, 148 downloads

GOA7: a few notes

January 4th, 2022

The first note is that Walt at Random is working again…

Second, if you want to track daily progress, follow me on Twitter. Maybe not every day, but most days. And I probably won’t have loads of other tweets while the scan is going on!

Third, I’ll get back to the stats for GOA6 in a few days.

Finally, a note on why the public dataset and primary book might not happen until September (almost-worst case scenario):

Counting backwards, it takes about a month to massage the data and prepare the book once all data gathering is complete. (It’s that fast largely because the templates are already ready to go–for most tables and graphs in most chapters, preparing the tables and graphs is literally loading the appropriate data into the first page of a spreadsheet and doing one Refresh on each other page. You gotta love pivot tables, specifically the fact that a named source can be everything in a column, and that’s dynamic.) So that takes me to as late as late August to finish data gathering.

Rechecking–going back to problematic journals and ones where it seems likely that more 2021 issues/articles will have been posted between 1/1/22 and 4/1/22–is likely to take up to a month. That takes me to late July.

I’m hoping that the date will move up to mid-June or even mid-May–but life intervenes. (To reach mid-June I have to average 100 journals a day, every day. To reach mid-May I need to average 1,000 journals a week. The former is possible. The latter is increasingly unlikely.)

Fact is, I’m getting older and probably slowing down. And there are more personal and family health issues over time, issues that require time. And, well, you can’t keep doing the scans all day without lots of breaks, even when there are no other issues. At least I can’t.

I thought for a while before proposing to do this seventh version. But decided to do it.

Miracles can happen–some health issues could subside, malware and other problematic cases could subside, and many more journals could fall into the dead-easy category. Based on the first 400+ journals, I’m not expecting loads of miracles.

OK, so this has also been a break… Back to the scan and afternoon coffee break.

GOA7: A note on currency fluctuation

December 27th, 2021

In this final week of December, I’ve already done the preliminary DOAJ download and assigned subjects (and normalized long country names) for 2,219 added journals (and 548 gone). The subject process went faster this year, such that all normalization was done by Christmas. So I did an interim update, adding 30 more and removing six. A final update will happen later Friday afternoon (after midnight GMT on 1/1/2022), probably adding another 10-20 and removing a few. Then the slog will begin, probably January 2. (Given time needed for family and personal health and other issues, and the considerably larger dataset, this process is likely to continue well into summer.)

In the meantime, I wondered about a probably-minor issue: to what extent might apparent fee changes be affected or masked by changes in currency strengths (since I convert all fees to USD)?

I prepared the conversion spreadsheet for GOA7 on December 24. As with last year, in most cases–32 currencies–the conversion rate was the 2021 annual average (from OFX). In seven cases, where OFX could not provide figures, I used the 12/24 daily rate from exchange-rates.org. The seven daily rates account for 361 journals; the 32 yearly averages account for 2,608.

So to what extent do fluctuations between 2020 and 2021 conversion rates matter?

  • Ten currencies weakened by 5% or more from 2020 to 2021 (one by just over 10%), representing a total of 844 journals–but that’s predominantly the 696 in GBP (pound sterling), since the pound did weaken significantly (6.78%) [The Euro also weakened against the dollar, but by only 3.74%–and journals designated in Euros and only Euros account for another 628.)
  • Seven currencies strengthened by more than 5% (four by more than 10%), but those seven only account for 50 journals.

Conclusion: overall, currency fluctuation is a relatively minor factor in fee fluctuation.

Now, off to start writing the Appendix (the part on preliminary steps) and do some non-GOA stuff for the rest of the week.