I was working on the draft of a future Online column (based, as they mostly are, on edited & updated material from previous Cites & Insights) on the uniqueness of everyday language–taking the two-year-old test I ran, doing a new, slightly smaller, test and updating the commentary.
One piece of commentary had to do with the likelihood that people would use the same actual words to talk about the same thing–as someone commented, “after all, how many ways are there to discuss Hamlet’s ambivalence?” Two years ago, when I checked Google for the two key words, I came up with “around 57,000,” and didn’t see that any of the first 100 seemed to be the same text.
This time, I came up with more than ten times as many results (which I regard as having more to do with Google’s increasingly silly initial result numbers than anything else–yes, the database continues to grow, but by >10x in 18 months?)–and, in looking through the first 100 results, I found two pairs that sounded an awful lot alike.
In one case, a Yahoo! Answer was almost identical to a paragraph in a Wikipedia article…but split into three paragraphs and with one or two word changes. Checking dates, it was pretty easy to conclude that the Yahoo! Answer was, shall we say, an innocent failure to attribute text to Wikipedia (text which was considerably older there). (Note: I’m not accusing Wikipedia of plagiarism–the text was pretty clearly copied from Wikipedia, not to it.)
The other was odder–a fairly long commentary on a scene from the play. One was from a signed, nicely formatted, set of discussions on Hamlet’s scenes (or, rather, on scenes from Act One, with the full set available as an inexpensive ebook). The other was from an ad-supported multiple-blog site, with no apparent authorship and with a bunch of HTML-like code appearing at the top of the “post,” and with no signature. I’m pretty sure I can guess which was copied from which–and in that case, since the original doesn’t include a copyright waiver, there’s more at stake than failure to attribute.
All of which is somewhat tangential to the original story. That story continues to be that everyday language is a lot less “common” than we may think–that, by and large, sentences at least 10 words long are likely to be unique even within a corpus as large as Google’s database. (The first test had relied mostly on my own writing and on first sentences of paragraphs; the new test uses the second sentence of the second paragraph of a post from each of 150 different blogs. Very similar results…) Basically, while one identical sentence (that is, a sentence in one work that’s also found in another) is absolutely, positively insufficient grounds to assert plagiarism–it may be enough to suggest that further checking is warranted.