Spam and Danny
Posted: August 20th, 2002 Comments Off on Spam and DannyThere are many spam-filtering systems being discussed at the moment. Some are popular. Some are new and interesting. Some are well-intentioned but harmfully flawed.
And some are, help doctor frankly, find brilliant.
I have a couple of reservations, though: there’s still a blacklist underneath, which may be prone to the same problems that hit Prof. Felten (and all the previous victims of MAPS, ORBS etc.). And what’s with all the patents? Are they there as a vital part of the legal mechanism, or simply to stop others jumping in on the business model? Talking of which, does anyone else have the little nagging worry that a single company could end up holding email to ransom? Such is the problem of a protocol that relies on being proprietary.
Incidentally, the piece linked above is the first of a series of articles by Danny that he’s writing in order to learn how to write like a journo again because he needs the money to support a pregnant wife who needs a job or she’ll just sit around and irritate people. Given that he already proves he’s one of the best writers on the net on a weekly basis, justice demands that he doesn’t go hungry.
Danny and I were discussing spam filtering on the way to Dorkbot SF last week. He gave some convincing arguments against the particulars of the SpamAssassin approach, especially the way that it screws up HTML mail; while most of us consider HTML mail to be bad thing, messing with the contents of mail is worse. (There’s also a nasty bug that screws up whitelisting, but I can’t remember the full details) One of the biggest problems is that despite having a wicked-nifty genetic algorithm for determining rule scores, this algorithm is run over mailboxes belonging to the developers, and so is tuned to the kind of email they receive (very little HTML mail, apparently), which is not necessarily the same as yer average user. Paul Graham’s system solves this problem by training its filters, Bayesian-style, on a per-user basis; the trouble with this is that it requires a fair degree of integration with the user’s mail system.