Ego and similarity: A silly little post

I was doing my quarterly trio of “ego searches” just for fun (Yahoo! shows the largest ‘result count’ this time around for “Walt Crawford” and “Cites & Insights,” Google for “Walt at Random”), and decided to explore the first page or two of Google’s absurdly large “Walt at Random” result set (40,700).

Very informative. It showed me 89 items. Everything else–presumably, more than 40,600–was “very similar” or whatever the term is. (I didn’t redo the search.)

[Yahoo! finds 346 out of its claimed 27,300. That’s more plausible, although a lot of those links make no sense at all. The ways of spam sites are passing strange…]

As I finish this coffee break, one word of advice to the spammers who don’t actually read this stuff anyway:

Telling me how wonderful my blog is and/or what a great grade you’re going to get for finding this blog on (whatever, usually a topic I’ve never commented on, and certainly not in the post it’s attempting to comment) and/or “asking for help” in setting up your blog doesn’t cause me to turn gushing idiot and approve the post. Those posts still get reported as spam based on the domain name and links. As with most spammers, you’re wasting my time and yours–and, when it gets extreme, adding to the list of topics that can’t be commented on because I add absolute word blocks.

4 Responses to “Ego and similarity: A silly little post”

  1. Laura says:

    Sometime, if you feel so inclined, I’d be interested in hearing about your system for setting up absolute word blocks and such. I, too, am getting tired of the “great reading/how do i set up a blog?” spam, but at the moment, I just report them as spam and then delete them. I’m using Spam Karma 2, but I haven’t poked around under the hood at all.

  2. As you see, the Google results number is highly misleading – it’s enormously distorted by similar pages (i.e., different URLs which point to the same page).

  3. walt says:

    Laura: Unless things have changed in newer versions of WordPress, just click on Options in the dashboard, then Discussion from the Options screen.

    In addition to general controls–e.g., turning on overall moderation, determining the number of links that can get in without moderation–there are two panels in which words can be entered.

    The upper panel, Comment Moderation, will force any comment containing those words or word portions–anywhere in the message or header–into moderation. I’m using it to prevent one particular poster from posting directly.

    The lower panel, Comment Blacklist, is prepopulated with a bunch of spamment words (medication names, for example). You can add words that will cause comments to be blocked entirely–with caution, since partial words can match and you won’t even know that comments have been blocked.

    I don’t currently use any spam plugin; WordPress’ regular controls have been good enough to date. (As with spam in work mail, which is now 99% blocked by Postini, spammers seem to come in waves–when nothing works, the attempts decline for a while.) I’d been thinking about adding a Capcha-like utility, but actually would prefer not to add such a bar. (Not as bad as requiring site registration, to be sure.)

    Seth: Oddly, while the overall results number is misleading as all get out (which I already knew), the set of displayed results is, I believe, misleadingly low. I think Google’s definition of “similarity” is too inclusive in some cases. But the numbers problem is similar across the board, with the possible exceptions AskJeeves/Teoma (which look to be plausible in my “ego searches”) and Alexa’s Google-derived numbers (which appear to be too low). MSN’s numbers are a whole lot lower than Yahoogle! and may or may not be more plausible.

    It’s useful to be reminded, as you keep reminding people, that using any Yagooms numbers greater than 1,000 as being significant of much of anything is misuse of “statistics.” Basically, it’s a more refined version of the old “one, two, many” counting scheme: one to 999 and Many.