Emmesmail Development, 2004 to Present

2004-2014

Use of a whitelist and blacklist to augment the Bayesian filter was introduced in mid-2004. A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer. Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it ran only under Linux.

2015

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (sender is in the whitelist)
ok-passed-all (all filters used thought the email was not spam)
spam-blacklist (sender is in the blacklist)
ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)
spam-bayes (the Bayesian filter thought the email was spam)
ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)
spam-token-paucity (too few tokens for the number of characters in email)
ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)
spam-unrecognized-words
ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)
spam-missed (this spam email missed by all the filters)
spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

This allowed more detailed description of the filter results.

In the table below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Sender-filtering 2020 1404 0 70 0
Bayesian-filtering 616 520 7 84 1.1
Token-paucity 96 67 9 70 1.4
Unrecognized-words 29 10 0 34 0
All filters combined 2020 2001 15 99.1 2.5
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
611 590 6 15 2.5


2016

In years prior to 2016, Emmesmail used a whitelist, a blacklist, a Bayesian filter, a token-paucity filter, and an "appropriateness" or "unrecognized words" filter.  Filtering based upon the whitelist and blacklist was referred to as sender-filtering.  In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist.  Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well.  White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering.  Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Bayesian-filtering 2264 2193 7 97 1.1
Unrecognized-words 69 39 6 57 0.9
All filters combined 2264 2234 13 98.7 2.0
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
647 579 55 13 2.0


2017

The success of eliminating the blacklist encouraged us also to try to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate.  In 2017, our intention was to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering).  For many years, prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam.  This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam.  Our plan for 2017, after eliminating the unrecognized filter, was to slowly increase the value of PUNK and examine the effect this had on the spam rejection and false positive rates.  While the results are not conclusive, plots of the spam rejection and false-positives rates versus PUNK suggest that higher values of PUNK are well-correlated with a higher spam rejection rate and much less-well-correlated with the false-positives rate. We will attempt to confirm this in 2018.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Received Spam Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
All 2016 0.5 2264 2193 579 55 7 97.0 1.1
Jan-Mar 2017 0.6 576 558 222 20 1 96.9 0.5
Apr-Jun 2017 0.7 523 514 245 25 6 98.3 2.2
Jul-Sep 2017 0.8 503 490 208 59 3 97.4 1.1
Oct-Dec 2017 0.9 573 569 138 42 1 99.3 0.6

2018

By the end of 2017 it became apparent that our Bayesian filter had trouble distinguishing between required bank emails containing a one-use security code and the irritating ones from the same bank email address announcing that a requested transaction had been processed.   Thus, in 2018, we used our director file (which determines, based upon the sender, recipient, and tokens in the subject, to which account the incoming emails should be directed) to redirect the security code emails to a totally separate account, so virtually no messsages from our bank end up in our normal email corpus, almost all being directed by the Bayesian filter to the spam corpus. Another thing learned in 2017 is that our filtering allows in so few spam emails that a test period of just three months does not contain sufficient data to discern small differences in performance.   Accordingly, in 2018 we extended the test period for each value of PUNK from three to six months and began calculating an estimate of the uncertainty in our measurement.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Directed Spam Bayes-tested Spam Bayes-Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
Jan-Jun 2018 0.9 2 924 911 327 37 2 98.6 ± 0.4 0.5
Jul-Dec 2018 0.8 0 42 41 16 0 0 97.6 ± 2.4 0


Emmes Technologies
Updated 19 Jul, 2018