Open Data, Crowdsourcing, Independent Research and Misgivings

Posted in Cites & Insights, open access on September 1st, 2014

or Why Some Spreadsheets Probably Won’t Become Public

If you think that title is a mouthful, here’s the real title:

Why I’m exceedingly unlikely to make the spreadsheet(s) for my OA journals investigations public, and why I believe it’s reasonable not to do so.

For those of you on Friendfeed, there was a discussion on specifically this issue beginning August 26, 2014. The discussion was inconclusive (not surprisingly, partly because I was being a stubborn old goat), and I continued to think about the issues…even as I continued to build the new spreadsheet(s) for the project I hope to publish in the November and December 2014 Cites & Insights, if all goes well, cross several fingers and toes.

Consider this a public rethinking. Comments are most definitely open for this post (if I didn’t check the box, let me know and I’ll fix it), or you’re welcome to send me email, start a new thread on one of the social media I frequent (for this topic, Friendfeed or the OA community within Google+ seem most plausible), whatever…

Starting point: open data is generally a good idea

There may be some legitimate arguments against open datasets in general, but I’m not planning to make them here. And as you know (I suspect), I’m generally a supporter of open access; otherwise, I wouldn’t be spending hundreds of unpaid hours doing these investigations and writing them up.

All else being equal, I think I’d probably make the spreadsheet(s) available. I’ve done that in the past (the liblog projects, at least some of them).

But all else is rarely equal.

For example:

  • If a medical researcher released the dataset for a clinical trial in a manner that made it possible to determine the identities of the patients, even indirectly, that would be at best a bad thing and more likely actionable malpractice. Such datasets must be thoroughly scrubbed of identifying data before being released.

But of course, the spreadsheets behind Journals, “Journals” and Wannabes: Investigating The List have nothing to do with clinical trials; the explicitly named rows are journals, not people.

That will also be true of the larger spreadsheets in The Current Project.

How much larger? The primary worksheets in the previous project have, respectively, 9,219 [Beall’s Lists] and 1,531 [OASPA] data rows. The new spreadsheets will have somewhere around 6,779 [the subset of Beall’s Lists that was worth rechecking, but not including MDPI journals], exactly 1,378 [the subset of OASPA journals I rechecked, including MDPI journals], and probably slightly fewer than 3,386 [the new “control group,” consisting of non-medicine/non-biology/non-biomed journals in DOAJ that have enough English in the interface for me to analyze them and that aren’t in one of the other sets] rows—a total of somewhere around 11,543. But I’m checking them more deeply; it feels like a much bigger project.

So what’s the problem?

The spreadsheets I’ve built or am building are designed to allow me to look at patterns and counts.

They are not designed for “naming and shaming,” calling out specific journals in any way.

Yes, I did point out a few specific publishers in the July article, but only by quoting portions of their home pages. It was mostly cheap humor. I don’t plan to do it in the new project—especially since most of the journals in the new control group are from institutions with only one or a handful of journals; I think there are some 2,200 publisher names for 3,386 journals.

This is an important point: The July study did not name individual journals and say “stay away from this one, but this one’s OK.” Neither will the November/December study. That’s not something I’m interested in doing on a journal-by-journal or publisher-by-publisher basis. I lack the omniscience and universal subject expertise to even begin to consider such a task. (I question that anybody has such omniscience and expertise; I know that I don’t.) I offered possible approaches to drawing your own judgment, but that’s about it.

Nor do I much want to be the subject of “reanalysis” with regard to the grades I assigned. (I don’t want angry publishers emailing me saying “You gave us a C! We’re going to sue you!” either—such suits may be idiotic, but I don’t need the tsuris.)

Releasing the full spreadsheets would be doing something I explicitly do not want to do: spreading a new set of journal grades. There is no Crawford’s List, and there won’t be one.

For that matter, I’m not sure I much want to see my numbers revalidated: for both projects, I use approximation in some cases, on the basis that approximation will yield good results for the kind of analysis I’m doing. (I’ll explain most of the approximation and shortcuts when I write the articles; I try to be as transparent as possible about methodology.)

For those reasons and others, I would not be willing to release the raw spreadsheets.

Could you randomize or redact the spreadsheets to eliminate these problems?

Well, yes, I could—but (a) that’s more unpaid labor and, more important, (b) I’m not sure the results would be worth much.

Here, for example, are the data label rows and one (modified) data row from part of the current project:

Pub Journal 2014 2013 2012 2011 Start Peak Sum Gr GrF APC Note
pos POS Physics 15 34 14 1 2011 34 64 B $600

The columns, respectively, show: the publisher code (in this case, Pacific Open Science, a nonexistent—I think—publisher I may use to offer hypothetical examples in the discussion. Their slogan: If an article is in our journals, it’s a POS!); the journal name; the number of articles in January-June 2014, all of 2013, all of 2012, all of 2011; the starting year; the peak annual articles; the sum of the four years; the letter grade; a new “GrF”—the letter grade that journals with fewer than 20 articles per year would get if they had more; the article processing charge for a 10-page article; and any note I feel is needed. (If this was the new DOAJ control group, there would be another column, because hyperlinks were stored separately in DOAJ’s spreadsheet; for the one I chose, “POS Physics” is itself a hyperlink—but, of course, there’s no such journal. Don’t try to guess—the actual journal’s not remotely related to physics.)

I’ll probably add a column or two during analysis—e.g., the maximum annual APCs a given journal could have collected, in this case 34×600 or $20,400, and for the new DOAJ group the subject entry to do some further breakdowns.

I could certainly randomize/redact this spreadsheet in such a way that it could be fully re-analyzed—that is, sort the rows on some combination that yields a semi-random output, delete the Pub column, and change the Journal column to a serial number equal to the row. Recipients would have all the data—but not the journal or publisher names. That wouldn’t even take very long (I’d guess ten minutes on a bad day).

Would anybody actually want a spreadsheet like that? Really?

Alternatively, I could delete the Gr and GrF columns and leave the others—but the fact is, people will arrive at slightly different article counts in some significant percentage of cases, depending on how they define “article” and whether they take shortcuts. I don’t believe most journals would be off by more than a few percentage points (and it’s mostly an issue for journals with lots of articles), but that would still be troublesome.

Or, of course, I could delete all the columns except the first two—but in the case of DOAJ, anyone wanting to do that research can download the full spreadsheet directly. If I was adding any value at all, it would be in expanding Beall’s publisher entries.

What am I missing, and do you have great counter-arguments?

As you’ll see in the Friendfeed discussion, I got a little panicky about some potential Moral Imperative to release these spreadsheets—panicky enough that I pondered shutting down the new project, even though I was already about two-thirds of the way through. If I had had these requests when I began the project or was, say, less than 2,000 rows into it, I might have just shut it down to avoid the issues.

At this point, I believe I’m justified in not wanting to release the spreadsheets. I will not do so without some level of randomizing or redaction, and I don’t believe that redacted spreadsheets would be useful to anybody else.

But there are the questions above. Responses explicitly invited.

[Caveat: I wrote this in the Blog Post portion of Word, but it’s barely been edited at all. It’s probably very rough. A slightly revised version may—or may not—appear in the October 2014 Cites & Insights. If there is an October 2014 Cites & Insights.]

Now, back to the spreadsheets and looking at journals, ten at a time…


Added September 3, 2014:

Two people have asked–in different ways–whether I’d be willing to release a spreadsheet including only the journal names (and publishers) and, possibly, URLs.

Easy answer: Yes, if anybody thought it was worthwhile.

There are three possible sheets:

  • The Beall list, with publishers and the publisher codes I assigned on one page, the journals (with “xxind” as a publisher code for Beall’s separate journal list) and publisher codes on another page. All (I believe) publisher names and most but not all journal names have hyperlinks. (Some publishers didn’t have hyperlinked lists I could figure out how to download.) That one might be mildly useful as an expansion of Beall’s publisher list. (This would be the original Beall list, including MDPI, not the new one I’m using for the new study.)
  • The OASPA list, similarly structured and same comments, lacking MDPI (which is in the new one I’m using for the new study).
  • The new “partial DOAJ” list–DOAJ entries that aren’t in medicine, biology or biomed, that have English as a language code and that aren’t–if I got it right–in the other lists. I don’t honestly see how this could save anybody any time, since all it is is a portion of what’s downloadable directly from DOAJ, albeit in May 2014 rather than now.

If someone wants one of these, let me know–waltcrawford@gmail.com. I may not respond immediately, but I’ll either return the sheet you want as an email attachment or, if there’s more than one request, possibly load it at waltcrawford.name or in Dropbox and send you a link.

 

 

Graphic honesty

Posted in Stuff on August 27th, 2014

wccsmall

Walt Crawford, August 20, 2014, Morgan Territory Regional Preserve

That’s me. By now, some of you may have seen smaller versions of that picture in various social media (Friendfeed, Facebook, Google+, Twitter), or the same version on my personal web page.

Technically, “Morgan Territory Regional Preserve” may be wrong–the picture may have been taken in the Los Vaqueros Watershed. We were hiking on the Whipsnake Trail, which is in both areas. It’s where the hiking group I usually spend Wednesday mornings with was a week ago.

When my wife saw the picture (one among several dozen posted as a “report” on the hike) she said it was a good one. I requested a copy from the photographer (Bill Leach, another hiker) and have now replaced my older picture with this one wherever I’m aware of an icon, avator or other picture appearing. (I’m sure I’ve missed one or two and will get to them when I see them.)

The previous picture was also from a hike, oddly enough also in Morgan Territory, but from two or three years ago. It replaced a considerably older picture.

I like using a current picture because it feels honest. (That this one is a really good picture doesn’t hurt.) It’s how I really look at very nearly 69 years old. I suppose I should have a snazzy younger picture ready for an eventual obituary (and actually we may have the perfect picture–oddly enough, not all that old), but I hope that’s a long ways away. I’ve seen enough authors and others who somehow never age in their publicity pictures; I’m not them, although I understand the urge.

Why am I posting this on a Wednesday morning when I should be on a hike? I just didn’t feel like it today; I probably skip one hike out of every four or five, either because of location (there’s one area I just don’t care for) or other reasons. (For those who know the east bay, today’s hike is also partly in Morgan Territory, but in a very different part of it–it’s a Finley Road hike, partly in Mount Diablo State Park, partly in Morgan Territory, with a little too much walking to get to and from the trailhead because there’s no parking anywhere nearby.)

One other note: Yes, that is a cheap floppy gardening hat rather than a snazzy Panama hat or other hiking hat. Why? Because I have a fat head, and this gardening hat is big enough to fit it. Most hats don’t.

No deeper meaning here.

 

Correction in Cites & Insights 14:9

Posted in Cites & Insights on August 26th, 2014

Thanks to the eagle eye of an early reader, I was alerted to an error on page 15, column 2, of Cites & Insights 14:9. While there are almost certainly grammatical and spelling errors in every issue, this one was a math error that changed the significance of the paragraph–and since it was caught so early, I did something I normally never do: I fixed the paragraph, added a “[Corrected 8/26/14]” flag, and reissued the publication.

If you’ve already read it or downloaded it and don’t wish to do so again, here’s the change:

In the paragraph beginning “Most of the university libraries…” (in the subsection “Elsevier journals–some facts”), I managed to reverse the British pounds to dollars calculation. Doing it properly means changing the last two sentences in the paragraph.

What was there originally:

Notably, assuming that a pound is worth $1.70, JISC struck a much harder bargain than American public universities in general: the range is from $7.36 to $49.27, with a mean of $18.45, less than half the mean for U.S. institutions. of course, the package may very well be different.

I was somehow dividing pounds by $1.70 rather than multiplying them. Fixing that yields this text:

Converting to dollars, the range is $21.27 to $142.39 with a mean of $53.37—higher than the U.S. figures except at the low end. [Corrected 8/26/14.]

My apologies for the error.

Cites & Insights 14:9 (September 2014) available

Posted in Cites & Insights on August 25th, 2014

Cites & Insights 14:9 (September 2014) is now available for downloading at http://citesandinsights.info/civ14i9.pdf

This two-column print-oriented version is 18 pages.

For those reading C&I online or on an ereader, the single-column 35-page 6×9″ edition is available at http://citesandinsights.info/civ14i9on.pdf

This issue includes:

The Front: Toward 15 and 200: The Report    pp. 1-2

I promised a list of supporters and sponsors and an overall report on the outcome of the spring 2014 fundraising campaign for C&I. Here it is. Oh, there’s also “A Word to the Easily Confused” about the definition of “journal,” the change in the masthead to “periodical” because some folks are easily confused, and the need for consistency when choosing to regard gray literature as worthless.

Intersections: Some Notes on Elsevier  pp. 2-16

A half-dozen subtopics (actually five subtopics and some miscellanea) involving Elsevier that haven’t been covered recently elsewhere in C&I.

The Back  pp. 16-18

Four mini-essays.

 

NOTE: One paragraph on page 15 of this issue includes erroneous (reversed) pounds-to-dollars calculations. Those have been fixed and the issue has been replaced. The net change: JISC did *not* apparently strike a much harder bargain with Elsevier; the UK prices are higher except at the low end, where they’re about the same.

More on the damage done

Posted in Libraries on August 19th, 2014

I’d like to call your attention to Wayne Bivens-Tatum’s latest “Peer to Peer Review” at Library Journal.

Go read it. Follow his advice.

You may also find it worthwhile to add to my LTR report–which is readily available for direct purchase–by looking at some other academic library patterns in Beyond the Damage: Circulation, Coverage and Staffing.

(I’d missed the “bilge” comment on WBT’s column on my 2013 book. But I do consider the source. Maybe it’s a form of praise…although I guess that source now has to condemn ALA for publishing further bilge!)

 

Clarifications

Posted in Cites & Insights on August 14th, 2014

Body of post deleted on the grounds of pointless semi-blind item and why bother?

Another meaningless musical post

Posted in Media on August 9th, 2014

Not sure why this stuck with me, but it did: a remarkable six-word lyric, to wit:

My heart cries out in desperation

which immediately precedes the chorus for the song (Ronnie Milsap, “Don’t Take It Tonight” if you’re too young to know it).

As far as I can tell, Milsap originated the phrase; Bing and Google show up other uses (mostly religion-related), but I’m guessing those authors copped it from Milsap, quite possibly without even realizing it.

In the original it is part of a love song–naturally, a lost love song.

Which brings me to the companion item: a remarkable little disquisition on happiness and love songs. This time the author is Harry Nilsson and it’s a bit long to quote without possibly being a copyright violation. So, instead, I’ll point you to the YouTube video:

I think the song’s hilarious in general (if you don’t know Nilsson, this is not, shall we say, his usual singing voice or accent). But the disquisition in particular comes as the spoken interlude in an otherwise-sung song, beginning right around 1:27 and running to 2:16.

For some of us, “…if everyone was happy…” is enough to trigger the whole sequence.

Have a nice weekend. (If you’re wondering, still happily married after 36.5 years. Tom Paxton wrote great lost-love/losing-love songs that didn’t refer to him either.)

Natureally, I’m delighted

Posted in Cites & Insights, open access on August 6th, 2014

My name appeared in a Nature news article today (August 6, 2014). Specifically:

The DOAJ, which receives around 600,000 page views a month, according to Bjørnshauge, is already supposed to be filtered for quality. But a study by Walt Crawford, a retired library systems analyst in Livermore, California, last month (see go.nature.com/z524co) found that the DOAJ currently includes some 900 titles that are mentioned in a blacklist of 9,200 potential predatory journals compiled by librarian Jeffrey Beall at the University of Colorado Denver (see Nature 495, 433–435; 2013).

and, later in the piece:

Bjørnshauge says that a small cohort of some 30 voluntary associate editors — mainly librarians and PhD students — will check the information submitted in reapplications with the publishers, and there will be a second layer of checks from managing editors. He also finds it “extremely questionable to run blacklists of open-access publishers”, as Beall has done. (Crawford’s study found that Beall’s apparently voluminous list includes many journals that are empty, dormant or publish fewer than 20 articles each year, suggesting that the problem is not as bad as Beall says.)

Naturally (or Natureally), I’m delighted to have my name show up, and a C&I issue linked to, in Nature. (It didn’t come as a complete surprise: the journalist sent me email asking about my affiliation–none–and, later, where I live.)

I’m not quite as delighted with the slant of that first paragraph (quite apart from the fact that Beall’s lists do not list some 9,200 “potential predatory journals,” they include publishers that publish or “publish” that number of journal names). Namely, I think the story is not that 900 “potentially predatory” journals appear in DOAJ with the loose listing criteria that site formerly used. I think the story is that more than 90% of the journals in DOAJ are not reflected in Beall’s list, given his seeming zeal to target OA journals.

But, of course, it’s the journalist’s story, not mine, and I do not feel I was quoted incorrectly or unfairly. (Incidentally, I don’t  have nits to pick with the second paragraph.)

I agree with Bjørnshauge that a blacklist is itself questionable.

Do I believe the much improved DOAJ will constitute a real whitelist? I’m not sure; I think it will be a great starting point. If a journal’s in the new DOAJ, and especially has the DOAJplus listing, it’s fair to assume that it’s probably a reasonably good place to be. (But then, I’m no more an expert in what journals are Good or Bad than Beall is.)

Anyway: thanks, Richard Van Noorden, for mentioning me. I hope the mention leads more people to read more about questionable journals than just Beall’s list. I strongly believe that the vast majority of Gold OA journals are as reputable as the vast majority of subscription journals, and I believe I’ve demonstrated that there aren’t any 9,200 “predatory” journals out there that are actual journals researchers with actual brains and a modicum of common sense would ever submit articles to.

A few readers may know that I’ve embarked on a related but even more ambitious (or idiotic) project, having to do with volume of articles and adding a new and very different control group. Dunno when (if?) I’ll finish the huge amount of desk work involved and produce some results. I do believe that, among other things, the results may shed some light on the apparent controversy over how prevalent APCs are among Gold OA journals… (And, incidentally, more financial support for C&I wouldn’t hurt this process.)

 

Cites & Insights 14:8 (August 2014) available

Posted in Cites & Insights on July 15th, 2014

Cites & Insights 14:8 (August 2014) is now available for downloading at http://citesandinsights.info/civ14i8.pdf

The two-column print-oriented issue is 32 pages long. A single-column 6×9″ version designed for online/tablet reading is also available, at http://citesandinsights.info/civ14i8on.pdf   (The single-column version is 61 pages long.)

This issue includes the following:

The Front: Once More with [Big] Dealing   pp. 1-2

If you read the June 2014 issue, you may be aware that “Big-Deal Serial Purchasing: Tracking the Damage” wasn’t available when I thought it would be.

It’s available now; this brief essay offers the link to the ALA Store page for the Library Technology Reports issue and notes the complementary book for those academic librarians with deeper interests.

I believe every academic library should pay attention to this issue of LTR. If your library subscribes, it should be available now (electronically) or in a few days (in print form). If it doesn’t, you should buy the issue as a separate. Some of you really would find Beyond the Damage: Circulation, Coverage and Staffing useful as well.

Words: Doing It Yourself  pp. 2-18

Notes on self-publishing and whether or not it makes sense for you (or for your library to assist with).

Intersections: Access and Ethics 3  pp. 18-32

A range of commentaries having to do with open access and ethics over the past 18 months or so–and a couple of brief followups on previous essays. (You may notice that one Very Large Journal Publisher doesn’t show up much in this essay. Its time will come.)

What’s not here: the list of C&I supporters and sponsors. I’ll add the three names (yes, three) in a later issue.

The Final Economist

Posted in Media on July 10th, 2014

It arrived on Monday–two days later than the cover date, but that happens sometimes.

It’s sitting in the special throne room plexiglass stand used to hold magazines being read in the throne room.

For the last year, it’s been the only magazine there–because it takes more than a week of throne room visits to get through an issue.

I never actually paid for The Economist; it was a Magazines-for-Miles deal using airline miles from one of several airlines I never plan to use again. Even at the absurd $0.02/mile exchange rate (which most people now think grossly exaggerates the worth of airline miles), the “price” was nowhere near $160, the one-year subscription price; I think it was around $60.

I’m one of those readers: I read most magazines cover to cover, and we subscribe to a lot of magazines. (Including ones that come with various other arrangements–e.g., VIA, On Investing, AARP The Magazine, Nature Conservancy, World Wildlife, and now the new ACLU magazine–it’s something over two dozen.)

So next week I’ll go back to having a mix of magazines in the throne room stand–Fast Company (well suited to the location), some of the infrequent “comes because you do something” magazines, maybe Fortune if I’m ahead on other things.

I decided not to renew some months ago–quite apart from the $160/year, which is more than we spend on any four magazines, much less one.

A few of the reasons why:

What I Won’t Miss

The strained British/slang/invented language the “newspaper” uses.

The feeling that the only difference between “leaders” (editorials) and other articles is that the leaders are explicitly slanted.

The constant slagging of the U.S. and especially Obama.

Added 7/11: I especially won’t miss the frequent admonitions for the U.S. to get into another shooting war.

The special definition of “liberal” used when business or markets are involved.

The sheer volume of it all.

What I Will Miss A Little

Being better informed (to the extent that you can filter out the slant) about a range of nations and economic issues.

Some of the special sections.

I might say “The World in 2014″–but I never received that special issue, and by the time I realized I should have received it, it was far too late.

What I Will Miss The Most

I’ll miss this enough that I’ll probably start extending my library visits so I can catch up with recent issues (I’m assuming they keep at least four back; if not, I’ll have to start going more often).

The final page, especially when there’s no obvious candidate for the obituary of the week.

I find the final page superb. I plan to keep reading it.

[By the way, in case any silly person thinks the only reason I’m dropping The Economist is the price and thinking of giving it to me: Please don’t. Contribute a third of the cost, or a little less, say $50, to Cites & Insights.]

In some ways, I’ve liked having a weekly magazine. Time is such a shadow of its former self that I’d find it sad to take (I read it for years, back when there was some substance to it). I might look at The Week or, less probably, Bloomberg BusinessWeek. Most likely, I’ll get used to not having a weekly–after all, I do still read the daily, even if via Kindle Fire 8.9.


This blog is protected by dr Dave\\\\\\\'s Spam Karma 2: 104740 Spams eaten and counting...