[TriLUG] Bayesian filtering

Jim Ray jim at neuse.net
Mon Sep 1 08:02:44 EDT 2003


i thought some of you would appreciate and actually understand this article
from Tech Republic:

Use DSPAM to reduce spam from a Linux mail server

http://techrepublic.com.com/5102-6261-5063820.html

August 27, 2003
Scott Lowe MCSE

Spam: It’s what’s for dinner. And breakfast. And lunch. And every snack in
between. Wherever you turn these days, spam is invading inboxes everywhere,
quickly making the jump from an annoyance to major business problem. Spam is
much more operating system-agnostic than many e-mail viruses, so you can
find a host of anti-spam solutions for a variety of products on a variety of
platforms.

One solution for UNIX and Linux mail servers is DSPAM, which acts as the
local delivery agent for the server and learns to recognize spam to ease the
administrative burden of constantly keeping up with blacklists. DSPAM uses a
Bayesian statistical analysis to improve the success rate and reduce the
percentage of false positives.

----------------------------------------------------------------------------
----
What's Bayesian analysis?
"Bayesian," according to Merriam-Webster Online, is “being, relating to, or
concerned with a theory (as of decision making or statistical inference)
involving the application of Bayes' theorem and the use of probabilities
based on prior knowledge and accumulated experience.” Simply put, DSPAM uses
an analysis of past results to continually improve its spam-detection rate,
resulting in a higher success rate as time goes on.
----------------------------------------------------------------------------
----

System requirements
DSPAM requires a mailer agent that is capable of using a configurable local
delivery agent and the Berkeley DB4 database. The Berkeley DB4 database is
an easy installation, and full instructions are provided in its accompanying
README file. As of this writing, the current version of DSPAM is 2.6.3, and
you can download it here. Let's walk through the process of installing and
configuring DSPAM.

----------------------------------------------------------------------------
----
My lab configuration
For this article, I am using Red Hat 9 and my mail server is Sendmail.
----------------------------------------------------------------------------
----

Installing DSPAM
First, download the latest version of DSPAM from the link above. For my
example, the filename is dspam-2.6.tar.gz. From the directory where you have
saved the download, execute the following command to expand the
distribution:
gunzip -dc dspam-2.6.tar.gz | tar xvf -

Now, change to the expanded directory with the command dspam-2.6. You can
build the configuration for DSPAM using a typical configure command with the
options shown in Table A.


Table A: DSPAM configuration options Parameter  Description  Default
--with-local-delivery-agent=[mail program]  Use the program specified as the
local mail delivery agent.  Depends on your system.
--with-userdir=[user directory]  Specify the directory where user
dictionaries, signatures, etc. should be stored.  /etc/mail/dspam
--with-signature-life=[# of days]  The number of days for the signature
life.  14 days
--with-db4-includes=[Location of DB4 includes]  Where to find Berkeley DB
4.1.x headers  Depends on DB4 install.


Since I did a typical install using Sendmail, I could use the following
command to begin the installation process:
./configure --with-db4-includes=/usr/local/BerkeleyDB.4.1/include/

I included the path to the DB4 includes to make sure that the configuration
script could find them. Unfortunately, on my Red Hat Linux 9 system, the
configuration failed with an error relating to the Berkeley DB 4 libraries,
even though I provided the location to find them. After finding the source
of the error and visiting the helpful user discussion forums at the DSPAM
Web site, I issued the following command before executing the configure
script again:
export
LDFLAGS='-Wl,--rpath -Wl,/usr/local/BerkeleyDB.4.1/lib -Wl,--library-path -W
l,/usr/local/BerkeleyDB.4.1/lib'

The LDFLAGS variable passes options that will be used during the
configuration phase of the installation.

Once the command prompt comes back and there are no errors, compile DSPAM
using the make command. To install the compiled binaries into their final
location, execute make install. This step needs to be performed as the root
user. After this completes successfully, DSPAM is ready to be used by your
mail program.

Changes to the Sendmail configuration
Once DSPAM is installed, you need to modify your Sendmail configuration to
use DSPAM as the local delivery agent. Doing this will force mail through
the DSPAM engine so that it can do its job.

Changing the local delivery agent to the DSPAM executable is accomplished by
modifying the Sendmail configuration file, sendmail.cf. Be sure to make a
copy of sendmail.cf before changing it.

To make DSPAM active, find the line at the bottom of sendmail.cf labeled
Mlocal. If you are not using procmail, the first option after Mlocal will
read something like P=/bin/mail. In this case, replace the contents of the
Mlocal line with the following:
Mlocal, P=/usr/local/bin/dspam, F=lsDFMAw5:/|@qfSmn9, S=EnvFromL/HdrFromL,
R=EnvToL/HdrToL,
       T=DNS/RFC822/X-Unix,
       A=dspam -d $u

If you are using procmail, which is identifiable by looking at the original
Mlocal line, you need to use a slightly different configuration. With
procmail, the first configuration option on the Mlocal line will read
P=/usr/bin/procmail, and you will replace the contents of the Mlocal line
with the following:
Mlocal, P=/usr/local/bin/dspam, F=lsDFMAw5:/|@qSPfhn9, S=EnvFromL/HdrFromL,
R=EnvToL/HdrToL,
       T=DNS/RFC822/X-Unix,
       A=dspam -t -Y -a $h -d $u

If you installed DSPAM to a different location, provide that location in
place of /usr/local/bin/dspam.

Adding mail aliases
DSPAM works by having the user forward spam to a unique account that is just
for this purpose. For each user who you want to use DSPAM, you need to add a
spam alias to the aliases file, which is typically located in either /etc or
/etc/mail. On my Red Hat 9 system, it is in /etc.

Use a text editor to edit this file and add an entry similar to the
following for each user:
spam-slowe: "|/usr/local/bin/dspam -d slowe --addspam"

The first part, spam-slowe, is simply an existing user ID with spam- as the
prefix. The second part, |/usr/local/bin/dspam, will pipe mail received to
this account through the executable you named (in this case, the DSPAM
executable). The -d slowe portion indicates that the name of the dictionary
is slowe. A separate dictionary is created for each use. Finally, --addspam
indicates that the mail will be used to process future spam.

After you have added an alias to the aliases file, run the command
newaliases to rebuild the aliases dictionary, aliases.db.

DSPAM with smrsh
If you are using a Sendmail system that uses smrsh (Sendmail restricted
shell), you also need to add DSPAM's executable as a program that is allowed
to be used by Sendmail. This is as easy as placing a link to the DSPAM
executable in the smrsh configuration directory, which is typically
/etc/smrsh. The following two commands accomplish this goal:
cd /etc/smrsh
ln -s /usr/local/bin/dspam dspam

If you use smrsh and fail to do this, you will be unable to forward spam to
the spam identification accounts, and DSPAM will be unable to learn its job.

Using DSPAM
At this point, you should have a working DSPAM/Sendmail system with
appropriate aliases for your users. Now, if your users receive spam, they
should forward it to the "spam-username" alias you set up for them. As DSPAM
learns what kind of mail the user considers spam, it will eventually begin
simply blocking the spam items. In general, DSPAM can begin blocking with
fewer than 50 e-mails forwarded to the spam agent, but it takes 200 to 300
for it to be truly useful.

As a test, I sent a few e-mails to the root user's spam account on my lab
system to see what kind of statistics DSPAM compiled. I can get details on
DSPAM's statistics by executing /usr/local/bin/dspam_stats. For the root
user, I got the following statistics:
root 0 TS 7 TI 1 TM 0 FP

This indicates that seven innocent messages and one spam miss have been
recorded, while no spam messages have been caught, and there have been no
false positives.

Administrative tasks
You need to perform some administrative tasks to keep DSPAM running
efficiently and to keep it from gobbling up too much disk space. Each night,
you should run a cron job that runs the dspam_clean program to clean the
signature database. To do this, add the following line to the nightly cron
job:
0 0 * * * /usr/local/bin/dspam_clean

Every five days or so, you should also run the dspam_purge program to
optimize the user dictionary files. The following cron configuration will do
the trick:
0 0 5,10,15,20,25,30 * * * /usr/local/bin/dspam_purge

Effective and free
DSPAM is not difficult to configure and maintain, and it can save an
organization both the administrative hassle and the financial burden that is
quickly mounting because of the massive amounts of spam that employees have
to deal with. Best of all, DSPAM is free, making it much more economical to
use than most other spam-fighting products.





More information about the TriLUG mailing list