Emmesmail Development, 2004 to Present

2004-2014

Use of a whitelist and blacklist to augment the Bayesian filter was introduced in mid-2004. A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer. Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it ran only under Linux.

2015

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (sender is in the whitelist)
ok-passed-all (all filters used thought the email was not spam)
spam-blacklist (sender is in the blacklist)
ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)
spam-bayes (the Bayesian filter thought the email was spam)
ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)
spam-token-paucity (too few tokens for the number of characters in email)
ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)
spam-unrecognized-words
ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)
spam-missed (this spam email missed by all the filters)
spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

This allowed more detailed description of the filter results.

In the table below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type

No. tested

No. caught

False Pos.

% Spam rej.

% False Pos.

Sender-filtering
2020
1404
0
70
0
Bayesian-filtering
616
520
7
84
1.1
Token-paucity
96
67
9
70
1.4
Unrecognized-words
29
10
0
34
0
All filters combined
2020
2001
15
99.1
2.5
  Total Valid Emails  
Whitelisted
Passed all filters
False Pos.
% False Pos.
611
590
6
15
2.5


2016

In years prior to 2016, Emmesmail used a whitelist, a blacklist, a Bayesian filter, a token-paucity filter, and an "appropriateness" or "unrecognized words" filter.  Filtering based upon the whitelist and blacklist was referred to as sender-filtering.  In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist.  Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well.  White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering.  Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type

No. tested

No. caught

False Pos.

% Spam rej.

% False Pos.

Bayesian-filtering
2264
2193
7
97
1.1
Unrecognized-words
69
39
6
57
0.9
All filters combined
2264
2234
13
98.7
2.0
  Total Valid Emails  
Whitelisted
Passed all filters
False Pos.
% False Pos.
647
579
55
13
2.0


2017

The success of eliminating the blacklist encouraged us to try also to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate.  In 2017, our intention was to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering).  For many years, prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam.  This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam.  Our plan for 2017, after eliminating the unrecognized filter, was to slowly increase the value of PUNK and examine the effect this had on the spam rejection and false positive rates.  At the end of the year, we will evaluate, based upon the following table, the best value for PUNK to use in 2018.

Efficiency of Bayesian Filter with Different Values of PUNK

Period

PUNK

Spam Received

Spam Caught

Valid W.L.

Valid Passed-all

False Pos.

% Spam rej.

% False Pos.

All 2016
0.5
2264
2193
579
55
7
97.0
1.1
Jan-Mar 2017
0.6
576
558
222
20
1
96.9
0.5
Apr-Jun 2017
0.7
523
514
245
25
6
98.3
2.2
Jul-Sep 2017
0.8
503
490
208
59
3
97.4
1.1
Oct-Dec 2017
0.9
231
230
57
27
1
99.6
1.2

2018

At the end of 2017, it became apparent that our Bayesian filter had trouble distinguishing between the necessary bank emails containing a one-use security code and the irritating ones announcing that a requested transanction had been processed. We decided to try to resolve this problem using our director file as an initial filter. The director file has been used since the inception of Emmesmail in order to decide to which account to direct the incoming email, based upon the sender, recipient and tokens on the subject line. We added a line to that file that should direct emails to the spam account whenever the token "eStatement" appears in the subject.

Efficiency of Bayesian Filter with Different Values of PUNK

Period

PUNK

Spam Directed

Spam Bayes-tested

Spam Caught

Valid W.L.

Valid Passed-all

False Pos.

% Spam rej.

% False Pos.

Jan-Mar 2018
0.9
Apr-Jun 2018
0.8
Jul-Sep 2018
0.7
Oct-Dec 2018
0.6


Emmes Technologies
Updated 11 Nov, 2017