The MTA/SpamAssassin Interface

Because SpamAssassin limits itself to analyzing e-mail and assessing its spamminess, the problem of what to do with spam is left up to the individual system administrator. In my opinion, the best thing to do with spam is reject it at SMTP time, so the burden of handling spam is placed squarely where it belongs: on the spam-friendly ISPs where it originates, and the incompetent open relays that allow it to propagate.

Unfortunately, I haven't yet figured out how to do that. So I have implemented three Python scripts that, I think, are the second-best way to deal with spam: deliver it to a special spam folder on the mail server, periodically mail spam reports to interested parties, and allow those interested parties an easy way to rescue false positives.

routespam

The first of these scripts is responsible for handling incoming email. routespam does the following:

Details

routespam's command line syntax is one of the following:

routespam sender_addr recip ...
routespam sender_addr recipients
In the first case, each recipient address is specified as a separate command-line argument, eg.
routespam sender@some.domain joe@your.domain jane@your.domain
In the second case, all recipients are specified together in a single comma-separated string:
routespam sender@some.domain "joe@your.domain, jane@your.domain"
(This is necessary to support running routespam from Exim's system filter, where the envelope recipients are only available in this form.)

Thus, your MTA must make the envelope sender and recipients available when you pipe the message to routespam. Additionally, your MTA should add Return-path and Envelope-to headers to the message before piping it to routespam; this will make life much easier for the other two scripts, reportspam and rescue_nonspam. (Actually, rescue_nonspam is useless without an Envelope-to header.)

There are also some subtle issues that you must pay careful attention to in order to avoid mail loops when SpamAssassin says a message is not spam and routespam re-sends it. In particular, routespam re-sends the message like this:

/usr/sbin/sendmail -f sender@some.domain -oMr spamc -- joe@your.domain jane@your.domain
Thus:

I have no idea how all this will work with MTAs other than Exim. I suspect routespam will need some tweaking to get it working with qmail, postfix, or Sendmail -- patches are welcome. With Exim, there are two ways to do it.

Using routespam with Exim

First, you can invoke routespam per-recipient. This has the advantage that you can pick and choose which addresses handled by your MTA are subject to spam filtering, and you have some latitude in the spam-detection options for each user. (Ie. if an address has a user account on the server, that user account can define ~/.spamassassin/user_prefs.) The disadvantage is that messages with multiple recipients will be processed multiple times. If a message is spam, it will be saved to the spam folder once for each recipient; if it is not spam, it will be resent as a separate message once for each recipient.

To configure Exim for per-recipient spam filtering, you need to add one router (director for Exim 3) and a corresponding transport. Here's a router for Exim 4:

# Mail for any local-part listed in /etc/exim/spamcheck_users is
# piped through my spamcheck script, which uses SpamAssassin to
# assess spamminess and routes messages accordingly.
spamcheck:
  driver = accept
  transport = spamcheck
  local_parts = lsearch;/etc/exim/spamcheck_users

  # Translated, this reads:
  #   if !(defined X-Spam-Flag) and
  #      !($received_protocol eq "spamc") and
  #      !($received_protocol ne "local") and
  #      !($sender_host_address ne 127.0.0.1) then:
  #      run this director
  condition = "\
    ${if and { {!def:h_X-Spam-Flag:} \
               {!eq {$received_protocol}{spamc}} \
               {!eq {$received_protocol}{local}} \
               {!eq {$sender_host_address}{127.0.0.1}} } \ 
         {1}{0}}"

  # No point sending mail back to the spammer, or a legit sender,
  # if my spamcheck script dies.
  errors_to = postmaster
  retry_use_local_part

  user = ${lookup{$local_part} lsearch {/etc/exim/spamcheck_users} {$value}}
  group = mail
The /etc/exim/spamcheck_users file determines which addresses are subject to spam-filtering, and at the same time it maps addresses to user IDs. For example:
greg : gward
gward : gward
says that any mail with a local part of greg or gward is subject to spam-filtering, and that routespam will be run as user gward for either address. And here's the corresponding transport:
# Pipe message through my routespam script, which uses spamc
# to determine if the message is spam.  If so, it's saved
# to /var/mail/spam.$local_part.  Otherwise, it reinvokes
# exim to send the message on to its intended recipient.
spamcheck:
  driver = pipe
  command = "/etc/exim/routespam $sender_address $local_part@$domain"
  delivery_date_add
  envelope_to_add
  log_output
  path = "/usr/bin:/usr/sbin:/bin:/usr/local/bin"
  return_output
  return_path_add

Second, you can invoke routespam on all incoming mail, using Exim's system filter. The advantage of this is that each incoming spam is saved to the spam folder only once, and non-spam messages that come in with multiple recipients are only repeated once, with all their original recipients. The disadvantage is that you can't have per-address customization of SpamAssassin's filtering rules.

Here's how I invoke routespam from the system filter (should work with both Exim 3 and 4):

# Let error messages through.
if error_message then
  finish
endif

# If this message originated locally, or if it has already been
# processed by spamcheck, then stop processing now.  Send it on as-is.
if ($received_protocol is "local" or
    $received_protocol is "spamc" or
    $sender_host_address is "127.0.0.1" or
    $h_X-Spam-Flag: is not "") then
  finish
endif

# Send all other mail to routespam.
pipe "/etc/exim/routespam $sender_address $recipients"
(Note that in real life, I use a much longer system filter that includes rudimentary virus-filtering. In other words, the above should work, but I have not tested it as a complete system filter.)

reportspam

Catching spam and saving it to a special folder is only the first step. Like any content-based filter, SpamAssassin has false positives: messages look like spam, but aren't. Thus, you need actual human eyeballs and brainpower to periodically scan the caught spam for false positives. Of course, you could always login to the mail server and run your favourite MUA on /var/mail/spam, but there's an easier way: run reportspam periodically (eg. nightly).

reportspam's main purpose is to generate a summary of all new messages in the spam folder and mail the summary to interested parties (currently hard-coded to "postmaster"). Once it has done so, it marks the new messages as "old" (ie. unread but no longer new) by moving them from the "new" directory of the Maildir to the "cur" directory. That way, future runs of reportspam won't report the same messages twice. Finally, reportspam can optionally delete old messages: if you run it with -d N, it will delete any message received more than N days ago. If it deletes any old messages, it sends a separate mail (also to postmaster) saying how many old messages it deleted.

Here's a sample crontab entry that runs reportspam twice a day (3:00am and 3:00pm) as the mail user, deleting any mail received more than five days ago:

00 03,15 * * * mail /etc/exim/reportspam -d 5 /var/mail/spam

Here's an example of reportspam's report mail:

To: postmaster
Subject: Spam report (mail.python.org): 16 new messages

[...]
 4) "Important Investor Info"
    from hotstocks@bigfoot.com
    (envelope sender <hotstocks@bigfoot.com>)
    for Zope@Zope.org
    received 2002-04-16 03:22:21

 5) "re:  If You Smoke..."
    from FreeBigDeals Newsletter <freebigdeals@reply.mb00.net>
    (envelope sender <freebigdeals@ofr.mb00.net>)
    for help@python.org
    received 2002-04-16 03:39:11
[...]
Note that reportspam relies on the presence of Return-path and Envelope-to headers in the messages in the spam folder to generate its report.

rescue_nonspam

Finally, now that you can shunt spam aside and easily be notified of false positives, you need to rescue those false positives. Currently, I do this with the help of my rescue_nonspam script, which takes a message on stdin, looks for the original recipient(s) in the Envelope-to header, adds an X-Rescued header, and re-sends the message to its original recipient(s).

I also like to save false positives for future review -- I like to periodically tweak my local SpamAssassin rules and scores to do a better job of catching spam and not catching real mail, so it's handy to have a stash of false positives around. Thus, I use a mutt macro to run rescue_nonspam:

# ^R to rescue a false positive: pipe it through
# rescue_nonspam script (which sends it to the original
# recipient), and save to /var/mail/fp-spam.
macro pager \Cr /etc/exim/rescue_nonspam^m/var/mail/fp-spam^m
macro index \Cr /etc/exim/rescue_nonspam^m/var/mail/fp-spam^m

rescue_nonspam is kind of lame, but it's better than nothing. I have vague plans for a nice web interface to the spam folder. It could replace both reportspam and rescue_nonspam (although I still think nightly runs of reportspam are a good thing, to ensure that somebody has a chance of seeing false positives!)

Installation

In addition to the three scripts above, you'll need lib/satools.py, which contains common code for them. The easiest thing is to install everything in one directory, eg.

cp lib/satools.py scripts/{routespam,reportspam,rescue_nonspam} /etc/exim

The install-scripts shell script might be useful, but you'll obviously have to customize it for your own use.