How Emmesmail Fights Spam

Emmesmail utilizes a multi-faceted approach to junk mail that achieves a spam rejection rate > 98%, with fewer than 1% false positives (each rejection of valid email being considered a false positive). Emmesmail uses a whitelist, a blacklist, a Bayesian filter and an "appropriateness" filter. The whitelist and blacklist are user-specific locally generated files; Emmesmail does not import any global lists of spammers.

Here is an outline of how it works:

1) The first thing Emmesmail does is to check if the sender is included in the local whitelist, a list of senders previously determined not to be spammers. If so, the email is immediately delivered.

2) The next thing Emmesmail does is to check if the sender is included in the local blacklist, a list of senders previously found to be spammers. If so, the email is re-directed from the recipient's mailbox to the spam mailbox.

3) If the sender is not included in either the whitelist or blacklist, the entire email, including the header, is examined by a Bayesian filter modeled after that of Paul Graham.

4) If the Bayesian filter reports that the email is likely spam it is re-directed to the spam mailbox, appended to the database of spam emails (see information on Bayesian filtering below), and the sender added to the blacklist. If the email is thought not to be spam based upon Bayesian analysis, it is examined by a proprietary "appropriateness" filter, which examines the appropriateness of the words used in the email. This additionally catches a small number of spam emails where inclusion of random innocuous words has fooled the Bayesian filter.

5) If an email passes all the filtering, it is forwarded to the recipient's mailbox. Just having an email passed by the filtering process is not sufficient to add that email's sender to the whitelist. This only occurs once the email is saved by the user/recipient.

If upon checking the spam mailbox, if it is found that a mistake has been made and an innocent email has been diverted there, a single click will correct the mistake, deliver the mail to the intended recipient, and correct the databases.

Initially, Emmesmail rejects spam based upon Emmes Technologies' databases that come with the product, until such time as the user's databases become large enough to use.

When Emmesmail has determined that an email is spam, it can, if configured to do so, return the email to the spammer with a customizable "failure-to-deliver" message. Most authorities recommend that this feature not be used.

Emmesmail's Implementation of a Bayesian Filter

We found that in implementing the Bayesian filter described by Paul Graham, the following parameters needed to be defined.

Parameter

Definition

Value chosen

MAXW

Maximum number of tokens allowed in the hash table

250000

MWDS

Maximum number of words considered when calculating weights

9000

WMIN

Minimum length of a hash table token

2

WMAX

Maximum length of a hash table token

40

PMIN

Minimum probabilty of a token

0.0001

PMAX

Maximum probabilty of a token

0.9999

PUNK

Probability given a token not seen previously

0.5

MINO

Minimum number of times a token must appear in corpi to count

4

MNUM

Maximum number of emails in each corpus before thinning

350

RNUM

Number of emails remaining after thinning

250

CUT

Likelihood above which an email is considered spam

0.5

NTW

Number of words to weigh in likelihood calculation

15

AFPB

Anti false-positive bias factor

0.5

-

Characters which act as token separators

\040, \011, \012, @, ?, /


WMIN: Was set to 2 to avoid examining single letters.

WMAX: This eliminates long undecipherable tokens as occur with pdf documents.

PMIN, PMAX: Not 0 or 1, in order to avoid division by zero in the calculations. Also, if too small, a single word can carry too much weight.

MINO: A word must occur at least four times in our corpi to be significant with regard to determining whether an email is spam. Graham used five, but we felt four might allow one less spam to be passed during the filter's training period.

MNUM, RNUM: When one of our corpi gets to contain 400 emails, we reduce it to include only the 300 most recent and then add new ones until the total number is again 400.

CUT, NTW: Like the original Paul Graham filter, we calculate the likelihood of an email being spam according to the formula

Likelihood = pspam/(pspam + pnspam)

where pspam = w1*w2*w3*....wn, and pnspam = (1-w1)*(1-w2)*...(1-wn), and where the wn are the weights of the tokens in the email. Like the original Graham protocol, we arbitrarily consider only the NTW (15) most significant (closest to 0 or 1) weights in the calculation of likelihood, and we reject emails whose likelihood of spam is greater than CUT. We set CUT to 0.5, a logical choice. Setting CUT to 0.9 as in Graham's formulation, gives the same results, since, as he points out, the probabilities tend to be close to 0 or 1, with hardly any falling between 0.5 and 0.9.

AFPB: The anti false-positive bias factor. The weights, wn, strictly should be calculated according to the formula

wn = a/( a + b )

where a and b are the frequency of the word in the spam and non-spam corpi respectively. The description of the original filter recommended counting the words in the non-spam corpus twice in order to reduce the incidence of false positives. In our implementation this amounts to using the formula

wn = a/( a + b*AFPB )

where AFPB is 2.0. As noted in the table and below, this is not the value we use.

Our attempt to implement Graham's formulation exactly did not achieve as high a spam rejection rate as he reported. Although initially this may have been do to bugs in our implementation, we now feel that at least some of this might be hav been due to a change in the type of spam being sent since the time of his original report.

We eventually were able to achieve spam rejection ratios as high as those reported by Graham, but only after we made a number of changes to our spam filtering, introducing what we refer to as hierarchical filtering, which also includes in addition to Bayesian filtering, sender-filtering using a user-specific whitelist and blacklist, and use of an "appropriateness" filter.

Before using sender-filtering, we made certain to prevent our own email address from appearing on either the whitelist or blacklist, thus frustrating spammers who send spam that appears to come from the intended recipient.

We also made some changes to our Bayesian filter.

One small change to the Bayesian filtering was to exclude certain words from the weighting procedure. When calculating the likelihood that a given email is spam, Emmesmail ignores the following tokens when they appear in the head of the email; Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, and Dec. Instituting this change alleviated the problem that a spate of spam emails at the beginning of a month was oftened followed by a false positive the first time a non-spam email arrived.

A more significant change involved the anti-false-positive bias factor, which was 2.0 in the original Graham formulation. We tried many changes to the AFPB, even allowing different values for the head and body of the email. None of these seemed to give spam rejection rates as high as Graham's. Finally, in frustration, we set the AFPB for both the head and body to 0.4. Surprisingly, this seemed to work, giving a spam rejection rate greater than 98% without a significant increase in false-positives. It should be noted that an anti-false-positive bias less than 1.0 is somewhat of a misnomer, since it acts more as an anti-spam bias. After more than two months in late 2008 without a single spam email slipping by our filter and a false positive rate of 1%, we set the anti-false-positive bias to 0.5.

Since version 8.4, we have applied an appropriateness filter to those emails that pass the Bayesian filter. The logic behind this is as follows. Standard emails, both spam and non-spam, contain a relatively narrow range of vocabulary so that once the spam and non-spam corpi are reasonably-sized, the majority of the words in these emails are already in the stored corpi. Some spammers choose to put random words in their emails, and some of these pass the Bayesian filter. Non-spam senders almost never include large numbers of inappropriate words in their emails. In order to trap the tiny fraction of spam emails with inappropriate words that might otherwise not get caught, we introduced in version 8.4 an appropriateness filter which examines whether those emails passing the Bayesian filter contain a majority of "appropriate" words or not.


Emmesmail's Spam Rejection Statistics by year

Year

Spam Emails Rec.

Spam Emails Rej.

Rej. Rate (%)

Valid Emails Rec.

Valid Emails Rej.

False Pos. (%)

2003

276

256

92.8

682

28

4.1

2004

1173

1099

93.7

834

15

1.8

2005

2749

2624

95.5

1008

10

1.0

2006

11677

11401

97.6

804

16

2.0

2007

11622

11433

98.4

642

9

1.4

2008

12971

12361

95.3

910

9

1.0

    Use of a whitelist and blacklist to augment the Bayesian filter was introduced in mid-2004. A filter looking at "inappropriate" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection.



| Emmesmail's Homepage | Emmes Technologies Homepage |



Emmes Technologies
Updated Nov 2, 2008

valid html 4.01!