[H-GEN] Spam Assassin
Jason Parker-Burlingham
jasonp at uq.net.au
Tue Oct 22 16:20:08 EDT 2002
[ Humbug *General* list - semi-serious discussions about Humbug and ]
[ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
Cor. This is a pretty nice bit of software. I installed it a few
weeks ago, got around to configuring it about two days ago, and I'm
reasonably impressed: it catches more spam than I do on my own, _and_
I get to be amused at what it deems "spam" from time to time (some
job-hunting reports I've signed up for, for example).
I thought perhaps some list members may find it illuminating to see
how I went about putting it through its paces, and the stumbling
blocks I found along the way.
First, I worked out how to have my mail client export a subset of my
personal messages to an mbox-format[2] mailbox. I selected the last
few weeks of my personal mail, as well as all the spam I ever
received[1].
Next I set up a small directory in /var/tmp (not in home! I had no
idea if something was going to be `clever' and start trashing stuff in
parent directories) with a copy of the spam mailbox (spambox) and the
personal mail (personalbox).
I copied the procmailrc that the SA documentation suggests people use
to capture spam and modified it after looking at Chris Biggs's
procmailrc (thanks Chris) and the Fine Manual.
:0fw
| spamassassin -P
:0:
* ^X-Spam-Status: Yes
caughtspam
Basically, the first rule pipes the mail into spamassassin, and the
(modified) message is then tested for the presence of the
"^X-Spam-Status: yes" header. If that test is positive, the message
is sent to the "caughtspam" box.
I set the procmail maildata directory to somewhere safe, added the
LOG= option to the procmailrc, and added a definition for DEFAULT= to
ensure mail not caught by the spam-status-yes rule would go somewhere,
too (instead of to my spool, which I'd thoughtfully emptied
beforehand, of course).
stopped mail fetching---a system-wide
fetchmail---by stopping the daemon, and disconnected from the net,
unplugging the modem for good measure, figuring that extracting
valuable email that decided to come in while I was testing would be,
uh, a pain.
Next, I worked out how to use the "formail" script (which comes with
procmail) to process messages one-by-one as though they'd just arrived
from Exim/fetchmail. The basic idea ran like this:
formail -s procmail -Y ./procmailrc < spambox
(The -Y is an option to *procmail*, not formail. See the manual. It
indicates that procmail should not trust content-length headers.)
At this point I could make the run, and go to the maildata directory
(where the caughtspam mbox was put, along with the procmail logfile)
and use the mailstat program (with the -k (no clobber) and -l (long
format) options) to see how many messages had been sorted to the "spam"
box, and how many to my "standard" inbox.
Some quick tests of speed and SA's hit-rate helped me work out any
kinks in my tests, play with SA options[3], and test writing back to
my spool file, before I decided to jump online again and set up a
small procmail entry on my shell account, to trap mail coming in to
use for a more-or-less live test before letting it all rip.
Bzzzt! You lose! My system helpfully tells fetchmail to get new mail
whenever the network comes up. Oh dear. All this mail is being sent
through procmail and spamassassin *right* *now*. Fortunately, things
had been sufficiently tested that I wasn't _too_ concerned, and I had
re-added Chris's procmail backup rules in case anything went
disastrously wrong.
Ultimately, my procmailrc looks like this (less some comments):
SHELL=/bin/bash
PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin
MAILDIR=$HOME/.maildata
LOGFILE=$MAILDIR/LOG
# Two safety-net rules so that if any of my new rules go haywire, mail will
# not be lost.
#
# Put a copy of all messages in the directory folder "backup"
# You must create this directory ($MAILDIR/backup) manually!
#
:0 c
backup
# Whenever a message is received, after it is added to the backup folder
# above, cd into that directory and delete all but the 256 most recent
# messages.
:0 ic
| cd backup && /bin/rm -f dummy `ls -t msg.* | sed -e 1,250d`
# pipe all mail through spamassassin, then check it for the spam-status:
# header.
:0fw
| spamassassin -P -a -L
Note that I have no reset DEFAULT= anymore, since my mail client
expects to read mail from /var/spool/mail/$USER. I could ask it to
fetch mail from ~/.maildata/whatever, but in the interests of making
the fewest changes possible, I decided to leave it the way it was.
Playing with procmail options while off-line helped do that.
Time taken was about two hours or so.
jason
[1] : My mail client `expires' (read: deletes) messages from my spam
inbox as it grows, so I had only a few hundred messages.
[2] : This is the format of /var/spool/mail/$USER on most systems
(qmail is a notable counter-example, I think). Not the world's
best format, but it *is* well-understood by more programs than
other formats.
If your mail client can't export to, and import mbox, then I
think that's probably a bug.
[3] : I made sure I turned on the SA -L (local tests only) option,
then realized I'm online whenever mail is delivered, and then
rerealized that this is not in fact true: late-night cron jobs
and mail enqueued by exim could be delivered while I was
offline, preventing SA from working properly.
--
||----|---|------------|--|-------|------|-----------|-#---|-|--|------||
| ``Ooooaah! |
| I'm getting so excited about cheese-making I can't stand it!'' |
||--|--------|--------------|----|-------------|------|---------|-----|-|
--
* This is list (humbug) general handled by majordomo at lists.humbug.org.au .
* Postings to this list are only accepted from subscribed addresses of
* lists 'general' or 'general-post'. See http://www.humbug.org.au/
More information about the General
mailing list