[H-GEN] Spam Assassin

Tue Oct 22 22:11:38 EDT 2002

[ Humbug *General* list - semi-serious discussions about Humbug and     ]
[ Unix-related topics. Posts from non-subscribed addresses will vanish. ]

On Tue Oct 22 2002 at 16:20, Jason Parker-Burlingham wrote:

> Cor.  This is a pretty nice bit of software.  I installed it a few

SpamAssassin?  Yes it is, very nice.  It is also easy to install and
keep updated ("perl -MCPAN -e "install Mail::SpamAssassin", 2.43 is
the current version).

I'm also using it as part of a mail filtering subsystem that plugs
right into sendmail itself (called mimedefang), and this filter is
catching spam very, very nicely (along with viruses and other
"undesirable" emails).

> I copied the procmailrc that the SA documentation suggests people use
> to capture spam and modified it after looking at Chris Biggs's
> procmailrc (thanks Chris) and the Fine Manual.
> 
> 	:0fw
> 	| spamassassin -P
> 	
> 	:0:
> 	* ^X-Spam-Status: Yes
> 	caughtspam

Two things here... the latest versions of spamassassin no longer use
the -P switch, it is the default behaviour.

And calling spamassassin itself directly for each message is very
expensive on system resources... I've seen peak cpu loads hit 100%
when doing it like this.  The drain is especially evident (even on
high-end boxes) if you try it with your formail trick (see below) or
when pulling in a lot of email at the one time with a pop client.

A much better way to do this is to have spamd running as a daemon
(it listens on a local socket), then do the filtering with the spamc
client.  When processing a lot of incoming email, you'll find that
using spamc/spamd is much faster (hardly noticable).

I do it like this in my ~/.procmailrc:

# do SpamAssassin check only if the email is < 100k
:0 fw
* < 100000
| spamc -p 2222
:0 e
{
  EXITCODE=$?
}
# deliver here if it is tagged as spam...
:0H :spammer/$LOCKEXT
* ^X-Spam-Status:.*Yes
| rcvstore +spam

  The first test is a sanity check, emails over ~100k are rarely
  spam and checking them is both resource-expensive and pointless.

  I already have spamd running on port 2222, started from my
  ~/.bash_profile script (if it isn't aleady running).

  Procmail delivers my email as individual messages into (n)mh style
  folders via rcvstore, spam goes into a ~/Mail/spam/ folder.  (It's
  just as easy to have all it put into a single "caughtspam" file
  like the first recipe).

  When using SA with sendmail/mimedefang, if a spam score over 9 is
  detected then I have the filter delete the original recipient list
  and divert the message into a local "spammer" mailbox.

  False positives don't happen often, but when they do then this
  makes it possible to find emails that got caught as spam when they
  were not.  This works well in a situation like an office where
  several people can have local IMAP access to that account to
  review what gets put into the spammer mailbox.  For an ISP, other
  mechanisms could be used to auto-whitelist and "seamlessly"
  re-post any emails that were erroneously quarantined.  (Heck, you
  don't want to be blamed for loosing people's emails! :)

The spam-hit level is determined by the "required_hits" value in
~/.spamassassin/user_prefs, the default is 5 (iirc) and I have found
that 7 is a good all-round value as a cut-off point for non-spam (a
few odd spams getting through with very rare false hits).

I also quickly found that I needed to use whilelists (in the same
file) to specify email from subscriptions and certain mailing lists
which otherwise tend to get tagged as spam.

>         formail -s procmail -Y ./procmailrc < spambox

"formail -s" is a very useful thing that I've used myself on many
occasions to split apart mailbox files and re-sort its contents :)

For added coolness you can include virus scanning.

Have a look at File::Spam, which is a perl-based virus scanner.

  It has a smaller virus database than many of the other linux-based
  scanners.  Some of them, like NAI's uvscan are commercial, others
  like clamav and openvirus, are not (searches at sourceforge.net
  will find these for you).

  But File::Scan works for most purposes, it is perl and therefore
  very flexable, and it is freely available.

  To keep it updated regularly, put this into /etc/crontab as a
  weekly cron job: "perl -MCPAN -e "install File::Scan".

  As it comes with perl module, File::Scan comes with no front-end,
  you have to DIY to actually put it to work, and it can't scan
  attachments in email unless they are first mime-decoded.

By using a back-end like MIME::Parser (from MIME::Tools) to pull
apart an email message into its various components, it is possible
to write a small perl script to act as a stdin->stdout filter for
virus filtering, suitable for use with procmail recipies (or
command-line use) in a very similar way to how spamassassin works.

  I'm working on such a script right now.  I want to plug it into a
  procmailrc recipe to work much like this, and I'm not far from
  achieving this.  It can already read in a mailbox file with one or
  multiple messages, pull them apart one by one, scan the
  attachments and check that they are safe... I'll be happy when it
  can produce a cleaned version of the mailbox, free from spam and
  viruses (which are put into another file).  If I get this far with
  it, I'm happy to post the result here for all to enjoy :)

And then there's Anomy::HTMLCleaner, which can do more magic... :)

Cheers
Tony
---*#*=-=*#*=-=*#*=-=*#*=-=*#*=-=*#*=---
  Tony Nugent <Tony at linuxworks.com.au>
  LinuxWorks - Gold Coast Qld Australia

--
* This is list (humbug) general handled by majordomo at lists.humbug.org.au .
* Postings to this list are only accepted from subscribed addresses of
* lists 'general' or 'general-post'.  See http://www.humbug.org.au/