ANNOYANCE-FILTER(1)ANNOYANCE-FILTER(1)NAMEannoyance-filter - automatically detect junk mail
SYNOPSISannoyance-filter [ options ]
DESCRIPTIONannoyance-filter uses Bayesian statistics to determine the probability
an E-mail message is junk based on an analysis of its contents compared
to collections of known junk and legitimate E-mail.
This program is under active development; new versions are posted
frequently at:
http://www.fourmilab.ch/annoyance-filter/
Please visit this page for news about the program and to download the
latest version.
The project is hosted on SourceForge, where you will find the CVS
source code repository and release archives:
http://sourceforge.net/projects/annoyancefilter/
USAGEannoyance-filter has a multitude of options which permit it to be used
in many different ways, but the most common application involves
training the program with collections of legitimate and junk mail in
order to create a dictionary which indicates the probability that words
identify a message as junk or non-junk (legitimate). Training must be
done before the program is used to classify incoming mail, but need be
done subsequently only when adding messages to the training
collections. As long as the overall content of the mail, junk and
legitimate, which you receive remains pretty much the same, there's no
need to retrain, but the ability to do so allows the program to
automatically adapt to evolving message content, which is particularly
characteristic of junk mail.
Suppose you have a collection of legitimate mail (in other words, mail
you wish to read) in a file named m-good and a collection of junk mail
(that which you don't wish to read) in file m-junk. These collections
may be in ``Unix mail folder'' format, which is simply the text of one
or more E-mail messages concatenated together in a single text file, or
may be the names of directories containing files, each of which may be
a single E-mail message or a Unix mail folder. In either case, if a
message file is compressed with gzip, it will be automatically
uncompressed on the fly. Directories of messages may not, however,
contain other directories of messages.
To train annoyance-filter with these collections and create a
dictionary, use a command like:
annoyance-filter--mail m-good --junk m-junk --prune --write dict.bin
where dict.bin is the name of the dictionary file you wish to create.
Now that the dictionary has been created, you can use it on subsequent
runs to compute the probability a message is junk and classify it
accordingly. Suppose you have an E-mail message in the file mail.txt.
To compute its junk priority and display it on standard output, use the
command:
annoyance-filter--read dict.bin --test mail.txt
To integrate annoyance-filter into a mail processing system such as
procmail, you'll usually want to run it as a filter which reads
incoming messages from standard input (piped there by the mail
processing system), classifies them and adds annotations to the message
header indicating the classification, then writes the message with
header annotations to standard output. The mail processing system may
then examine the header annotations and route the message accordingly.
To filter a message, again assuming the dictionary created by the
training run is in the file dict.bin, use the command:
annoyance-filter--read dict.bin --transcript - --test -
Here the --transcript option is used to request the input message be
copied to an output file, in this case standard output, specified by
``-'', with the message read from standard input, the ``-'' argument to
the --test option.
OPTIONS
Options are specified on the command line. Options are treated as
commands—most instruct the program to perform some specific action;
consequently, the order in which they are specified is significant;
they are processed left to right. Long options beginning with ``--''
may be abbreviated to any unambiguous prefix; single-letter options
introduced by a single ``-'' without arguments may be aggregated.
--annotate options
Add the annotations requested by the characters in options to
the transcript generated by the --transcript option. Upper
and lower case options are treated identically. Available
annotations are:
d Decoder diagnostics
p Parser warnings and error messages
w Most significant words and their
probabilities
--autoprune n
As the dictionary is bring built by appending mail to it with
the --mail and --junk options, unique words will automatically
be pruned from it whenever the dictionary exceeds approximately
n bytes. This is particularly handy when loading large
collections of messages with --phrasemax set greater than one,
as a very large number of unique phrases may clutter the
dictionary being built and exceed the memory capacity of your
computer. You could split the mail collection into multiple
parts and explicitly --prune after each part, but --autoprune is
much more convenient.
--biasmail n
The frequency of words appearing in legitimate mail is inflated
by the floating point factor n, which defaults to 2. This
biases the classification of messages in favour of ``false
negatives''—junk mail deemed legitimate, while reducing the
probability of ``false positives'' (legitimate mail erroneously
classified as junk, which is bad). The higher the setting of
--biasmail, the greater the bias in favour of false negatives
will be.
--binword n
Binary character streams (for example, attachments of
application-specific files, including the executable code of
worm and virus attachments) are scanned and contiguous sequences
of alphanumeric ASCII characters n characters or longer are
added to the list of words in the message. The dollar sign
(``$'') is considered an alphanumeric character for these
purposes, and words may have embedded hyphens and apostrophes,
but may not begin or end with those characters. If --binword is
set to zero, scanning of binary attachments is disabled
entirely. The default setting is 5 characters.
--bsdfolder
The next --mail or --junk folder will be parsed using ``classic
BSD'' rules for identifying the start of individual messages in
the folder. In BSD-style folders, the text ``From '' as the
leftmost characters of a line always denotes the start of a new
message: any appearance of this text in any other context is
always quoted, often by prefixing a ``>'' character. In the
default Unix folder syntax, ``From '' only marks the start of a
new message if it appears following one or more blank lines.
Note that you must specify --bsdfolder before each folder to be
read with BSD rules; it is not a modal setting.
--classify fname
Classify mail in fname. If it equals or exceeds the junk
threshold (see --threshjunk), ``JUNK'' is written to standard
output and the program exits with status code 3. If the message
scores less than or equal to the mail threshold (see
--threshmail), ``MAIL'' is written to standard output and the
program exits with status 0. If the message's score falls
between the two thresholds, its content is deemed indeterminate;
``INDT'' is written to standard output and the program exits
with a status of 4. The output can be used to set an
environment variable in Procmail to control the disposition of
the message. If fname is ``-'' the message is read from
standard input.
--clearjunk
Clear appearances of words in junk mail from database. Used
when preparing a database of legitimate mail.
--clearmail
Clear appearances of words in legitimate mail from database.
Used when preparing a database of junk mail.
--copyright
Print copyright information.
--csvread fname
Import a dictionary from a comma-separated value (CSV) file
fname. Records are assumed to be in the format written by
--csvwrite but need not be sorted in any particular order.
Words are added to those already in memory.
--csvwrite fname
Export a dictionary as a comma-separated value (CSV) fname with
this option. Such files can be loaded into spreadsheet or
database programs for further processing. Words are sorted
first in ascending order of probability they denote junk mail,
then lexically.
--fread, -r fname
Load a fast dictionary (previously created with the --fwrite
option) from file fname.
--fwrite fname
Write a dictionary to the file fname in fast dictionary format.
Fast dictionaries are written in a binary format which is not
portable across machines with different byte order conventions
and cannot be added incrementally to assemble a larger
dictionary, but can be loaded in a small fraction of the time
required by the format created by the --write command. Using a
fast dictionary for routine classification of incoming mail
drastically reduces the time consumed in loading the dictionary
for each message.
--help, -u
Print how-to-call information including a list of options.
--junk, -j fname
Add the mail in folder fname to the dictionary as junk mail.
These folders may be compressed by a utility the host system can
uncompress; specify the complete file name including the
extension denoting its form of compression. If fname is ``-''
the mail folder is read from standard input.
--list List the dictionary on standard output.
--mail, -m fname
Add the mail in folder fname to the dictionary as legitimate
mail. These folders may be compressed by a utility the host
system can uncompress; specify the complete file name including
the extension denoting its form of compression. If fname is
``-'' the mail folder is read from standard input.
--newword n
The probability that a word seen in mail which does not appear
in the dictionary (or appeared too few times to assign it a
probability with acceptable confidence) is indicative of junk is
set to n. The default is 0.2—the odds are that novel words are
more likely to appear in legitimate mail than in junk.
--pdiag fname
Write a diagnostic file to the specified fname containing the
actual lines the parser processed (after decoding of MIME parts
and exclusion of data deemed unparseable). Use this option when
you suspect problems in decoding or pre-parser filtering.
--phraselimit n
Limit the length of phrases assembled according to the
--phrasemin and --phrasemax options to n characters. This
permits ignoring ``phrases'' consisting of gibberish from mail
headers and un-decoded content. In most cases these items will
be discarded by a --prune in any case, but skipping them as they
are generated keeps the dictionary from bloating in the first
place. The default value is 48 characters.
--phrasemin n
Calculate probabilities of phrases consisting of a minumum of n
words. The default of 1 calculates probabilities for single
words.
--phrasemax n
Calculate probabilities of phrases consisting of a maximum of n
words. The default of 1 calculates probabilities for single
words. If you set this too large, the dictionary may grow to an
absurd size.
--plot fname
After loading the dictionary, create a plot in fname .png of the
histogram of words, binned by their probability of appearance in
junk mail. In order to generate the histogram the GNUPLOT and
NETPbm utilities must be installed on the system; if they are
absent, the --plot option will not be available.
--pop3port n
The POP3 proxy server activated by a subsequent --pop3server
option will listen for connections on port n. If no --pop3port
is specified, the server will listen on the default port of
9110. On most systems, you'll have to run the program as root
if you wish the proxy server to listen on a port numbered 1023
or less.
--pop3server server[:port]
Activate a POP3 proxy server which relays requests made on the
previously specified --pop3port or the default of 9110 if no
port is specified, to the specified server, which may be given
either as an IP address in ``dotted quad'' notion such as
10.89.11.131 or a fully-qualified domain name like
pop.someisp.tld. The port on which the server listens for POP3
connections may be specified after the server prefixed by a
colon (``:'') ; if no port is specified, the IANA assigned POP3
port 110 will be used. The POP3 proxy server will pass each
message received on behalf of a requestor through the classifier
and return the annotated transcript to the requestor, who may
then filter it based on the classification appended to the
message header. You must load a dictionary before activating the
POP3 proxy server, and the --pop3server option must be the last
on the command line. The server continues to run and service
requests until manually terminated.
--pop3trace
Write a trace of POP3 proxy server operations to standard error.
Each trace message (apart from the dump of the body of multi-
line replies to clients) is prefixed with the label ``POP3: ''.
--prune
After loading the dictionary from --mail and --junk folders,
this option discards words which appear sufficiently
infrequently that their probability cannot be reliably
estimated. One usually --prune s the dictionary before using
--write to save it for subsequent runs.
--ptrace
Include a token-by-token trace in the --pdiag output file. This
helps when adjusting the parser's criteria for recognising
tokens. Setting this option without also specifying a --pdiag
file will have no effect other than perhaps to exercise your
fingers typing it on the command line.
--read, -r fname
Load a dictionary (previously created with the --write option)
from file fname.
--sigwords n
The probability that a message is junk will be computed based on
the individual probabilities of the n words with extremal
probabilities; that is, probabilities most indicative of junk or
mail. The default is 15, but there's no obvious optimal setting
for this parameter; it depends in part on the average length of
messages you receive.
--statistics
After loading the dictionary from --mail and --junk folders,
print statistics of the distribution of junk probabilities of
words in the dictionary. The statistics are written to standard
output.
--test, -t fname
Test mail in fname and write the estimated probability it is
junk to standard output unless the --transcript option is also
specified with standard output (``-'') as the destination, in
which case the inclusion of the probability and classification
in the transcript is adjudged sufficient. If the --verbose
option is specified, the individual probabilities of the ``most
interesting'' words in the message will also be output. If
fname is ``-'' the message is read from standard input.
--threshjunk n
Set the threshold for classifying a message as junk to the
floating point probability value n. The default threshold is
0.9; messages scored above --threshjunk are deemed junk.
--threshmail n
Set the threshold for classifying a message as legitimate mail
to the floating point probability value n. The default
threshold is 0.9, with messages scored below --threshmail deemed
legitimate. Note that you may leave a gap between the
--threshmail and --threshjunk values (although it makes no sense
to set --threshmail higher). Mail scored between the two
thresholds will then be judged of uncertain status.
--transcript fname
Write an annotated transcript of the original message to the
specified fname. If fname is ``-'', the transcript is written
to standard output. At the end of the message header, an
X-Annoyance-Filter-Junk-Probability header item giving the
computed probability and an X-Annoyance-Filter-Classification
item which gives the classification of the message according to
the --threshmail and --threshjunk settings; the classification
is given as ``Mail'', ``Junk'', or ``Indeterminate''.
--verbose, -v
Print diagnostic information as the program performs various
operations.
--version
Print program version information.
--write fname
Write a dictionary to the file fname. The dictionary is written
in a binary format which may be loaded on subsequent runs with
the --read option. Binary dictionary files are portable among
machines with different architectures and byte order.
EXIT STATUS
The program exits with a status of 0 when processing is successfully
completed, 1 when an error (I/O or file access in most cases) occurs,
and 2 to indicate a command line syntax error. If the --classify
option is specified, an exit status of 0 identifies the message tested
as legitimate mail, 3 marks it as junk, and a status of 4 is returned
for messages which cannot be confidently classified as either mail or
junk.
FILES
Files are read or written as requested by options on the command line;
all options which read or write files take a fname argument which gives
the file name. The --classify, --junk, --mail, --test, and
--transcript options interpret an argument of ``-'' as denoting
standard input or output.
On systems which provide the required services and utilities, arguments
to the --junk and --mail options may be compressed files or the name of
a directory containing one or more messages which will be read as if
logically concatenated. Messages in the directory may be compressed or
uncompressed.
Error messages and diagnostic output generated when the --verbose
option is specified are written to standard error.
BUGS
Millions, doubtless. This is a program which must cope with whatever
garbage is fed to it from mail folders, trying to make the best of it.
When it messes up, your efforts in identifying the message which caused
the problem and submitting a verbatim copy of it with your bug report
are much appreciated.
Please report bugs to bugs@fourmilab.ch and include annoyance-filter in
the Subject line. Thanks in advance.
AUTHOR
John Walker
http://www.fourmilab.ch/
This software is in the public domain. Permission to use, copy, modify,
and distribute this software and its documentation for any purpose and
without fee is hereby granted, without any conditions or restrictions.
This software is provided ``as is'' without express or implied
warranty.
SEE ALSOgnuplot(1), gs(1), gzip(1), netpbm(1), procmail(1), xpdf(1)annoyance-filter is written using the Literate Programming
http://www.literateprogramming.com/ methodology; the user manual,
program, and internal documentation are developed together, closely
interlinked. Whenever the program is modified, the documentation is
automatically updated, reducing the risk of divergence between what the
manual says and what the program does.
This man page is intended as a reference for the command line options
and most common applications of the program. For comprehensive
documentation, including details of how to integrate annoyance-filter
with the procmail mail processing system, please refer to the complete
documentation published in PDF format, available on the Web at:
http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf
If you have downloaded the annoyance-filter source distribution, the
corresponding version of annoyance-filter.pdf is included in the
archive. You can read PDF files with Acrobat reader (a free download
from http://www.adobe.com/acrobat/readstep.html) or the xpdf or
Ghostscript (gs) utilities.
4th Berkeley Distribution 19 FEB 2003 ANNOYANCE-FILTER(1)