Dealing with comment spam using Bayesian logic
In the past couple of months, I've begun to notice the occasional posting of 'comment spam' on my website. These have tended to include a short string of nonsense "mad-lib" style text, followed by a large number of offsite links. I currently utilize the tracker module to at least glance at every comment left on my website, so I eventually find this spam and manually delete it. However as the rate of this comment spam has increased, I've been looking for a better way to deal with it.
Not wanting to re-invent the wheel, I began by looking at Spamassassin and other free anti-spam tools. I had hoped to integrate one of these tools into Drupal, letting it do the actual work of deciding whether or not a given comment was spam. With further research, I found that this wasn't very workable as these anti-spam tools tended to be very mail-centric, looking at more than just the body of the email. Instead, I read up on using Bayesian logic, and ultimately decided it would be best to write a simple Bayesian filter in PHP.
Spam module
Thus, last weekend the spam.module was born. Based on Paul Graham's papers on the subject, it breaks each new comment into words, finding those that are the "most interesting", which means those that are most likely spam or most likely not spam. Using the selected words, it calculates the probability of whether or not the new comment is spam. For words that have a high probability of being spam, special actions can be taken (such as preventing the comment from being viewed).
Bayesian logic, how it works
In the beginning, the module does not know what is spam and what is not spam. In the default configuration, it will assume that all words it has not seen before have a 40% chance of being spam. Based on this assumption, all comments it sees will be considered non-spam. It is up to the site administrators to "teach" the module when a comment was actually spam, by clicking a link at the bottom of the comment that reads "mark as spam". At this point, the occurance of all words in the spam comment will be counted and stored in a "spam" database table. Should the module mistakingly label a valid comment as spam, the administrator will need to click "mark as not spam", and the words from that comment will be counted and stored in a separate "nonspam" table.
When the module sees that either the "spam" or "non-spam" tables have changed, it will recalculate the probability of every word in these tables being spam or non-spam, storing these probabilities in a third "spam-probability" database table. It is against this third table that all new comments are weighed to determine whether or not they are most likely spam.
By default, the module operates in TOE mode, or Train On Error mode. That is, if it correctly labels a comment as spam or non-spam, it doesn't learn from the comment. Only if it makes a mistake which an admin then manually corrects will it add the comment's words into the appropriate database table for calculating new probabilities. Alternatively, the module can operate in in TEFT mode, or Train Everything mode. This is also known as "auto-learn" mode, as it will store away words from all comments that it sees to try and fine tune its probability tables. While both modes are supported, research seems to favor TOE as ultimately being more reliable.
As spam comments are still relatively rare, it will probably take a very long time to train the Bayesian logic to properly catch most spam. It is doubtful that you have a large pool of spam-comments to draw from, meaning you will simply have to train as you go (another reason that TOE mode is well-suited). That said, it is probably only a matter of time before spam comments become a regular nuisance, at which time training the filter won't take nearly as long.
Plans for the module, looking for feedback
At this time, the spam.module is in early development. In particular, while the underlying logic is believed to be fully functional it has not been optimized for best performance, the administrative interface used to control the module is still rough, and the module doesn't actually take any actions when it detects spam. That said, none of these are difficult obstacles, so I expect to have a fully functional and useful module in the near future.
The purpose of this article is to generate some feedback. I have a number of ideas, but would be interested in hearing more from others that have also thought about this problem, or have perhaps already begun to try and tackle the problem.
For example, what is the best action to take when the module detects spam? It could mail the administrator to let him/her know what has been posted to his/her site. It could prevent the spam from being posted. It could let the spam be posted, but prevent it from being displayed. It could operate silently, or tell the offender that his/her comment appears to be spam. It could save the poster's IP address and blacklist it preventing the user from leaving more contents, or even from accessing the website. It could provide all of these options and/or others, allowing the administrator to choose the best action.
As for detecting spam, the method currently used is simple Bayesian logic. It does not try and be intelligent, looking for certain "tell tale phrases". Personally I think going that route is a loosing battle, as it's only a matter of time until the "tell tale phrases" would be changed. However, I do intend to explore using Markovian logic, looking at multiple words together in addition to looking at each word individually.
I've also received spam forum and story postings on my web site. Thus, I intend to expand the module to also scan newly submitted nodes. However, it may not be necessary/appropriate to scan all nodes, so I'm thinking to again make this configurable. For example, you could enable spam filtering on 'comments' and 'forum posts', but not on 'stories' or 'book pages'.
Large Drupal sites may not be focused on only one topic. This could make it more difficult to properly train the Bayesian filter, as what is appropriate in one of the site's forum may not be at all proper in another one of the site's forums. To handle this, I'm considering the possibility of allowing the administrator to break his/her site into multiple logical sections, allowing each their own Bayesian databases. However, I've not decided if the increased complexity is worth the gain.
The actual interface for marking comments as spam or not-spam also needs a lot of work. The current implementation makes comments feel far too cluttered. Also, the administrative spam overview page should probably be reworked to provide more functionality.
Finally, I'm open to ideas on how to improve the Bayesian logic itself. For example, the method for tokenizing content should probably be smarter, especially when dealing with html markup, IP addresses, quoted content, numerical ranges, and monetary units. The function for measuring the spam probability of content currently accesses the database once for every token and needs to be optimized, probably by reading larger chunks from the database at a time. If TEFT mode is used, a mechanism to prevent learning and relearning the same spam or non spam over and over will also need to be developed to prevent word prejudice. Additionally, the algorithm that determines the probability of a given word being spam could also be improved.
Using the module
All this said, the module is currently being tested on my website, KernelTrap.org. In the 24 hours it's been installed, I've already fed it two true spam comments. For the first time I actually look forward to spam, wanting to see how the module will perform. Of course, patience is necessary, as it will probably have to see a couple hundred spam comments before it's able to actually recognize them on its own. I best be careful what I wish for.