|
Current spam detection techniques
Spam is an ever-increasing problem. The number of spam mails is increasing daily - studies show that over 50% of all current email is spam; the Radicati Group predicts this will reach 70% by 2007. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart 'static' methods of fighting spam.
The techniques currently used by most anti-spam software are static, meaning that it is fairly easy to evade by tweaking the message a little. To do this, spammers simply examine the latest anti-spam techniques and find ways how to dodge them.
To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers' tactics as they change over time. It must also be able to adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics.
How the Bayesian spam filter works
Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event.
This same technique can be used to classify spam. If some piece of text occurs often in spam but not in legitimate mail, then it would be reasonable to assume that this email is probably spam.
Creating a tailor-made Bayesian word database
Before mail can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mail and valid mail (referred to as 'ham').
Creating a word database for the filter
A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in spam as opposed to legitimate mail (ham). This is done by analyzing the users'outbound mail and by analyzing known spam: All the words and tokens in both pools of mail are analyzed to generate the probability that a particular word points to the mail being spam.
This word probability is calculated as follows: If the word "mortgage" occurs in 400 of 3,000 spam mails and in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000] divided by [5/300 + 400/3000]).
Creating the ham database (tailored to your company)
It is important to note that the analysis of ham mail is performed on the organization's mail, and is therefore tailored to that particular organization. For example, a financial institution might use the word "mortgage" many times over and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company's valid outbound mail (and recognizes "mortgage" as being frequently used in legitimate messages), and therefore has a much better spam detection rate and a far lower false positive rate.
Note that some anti-spam software with very basic Bayesian capabilities, such as the Outlook spam filter or the Internet Message Filter in Exchange Server, does not create a tailored ham data file for your company, but ships a standard ham data file with the installation. Although this method does not require an initial learning period, it has 2 major flaws:
1. The ham data file is publicly available and can thus be hacked by professional spammers and therefore bypassed. If the ham data file is unique to your company, then hacking the ham data file is useless. For example, there are hacks available to bypass the Microsoft Outlook 2003 or Exchange Server spam filter.
2. Such a ham data file is a general one, and thus not tailored to your company, it cannot be as effective and you will suffer from noticeably higher false positives.
Creating the spam database
Besides ham mail, the Bayesian filter also relies on a spam data file. This spam data file must include a large sample of known spam and must be constantly updated with the latest spam by the anti-spam software.
|