Fighting back at spam

Over the past few months, the spam module has evolved from a simple idea to a fully functional collection of tools that can automatically deal with spam comments and other spam content posted to a Drupal powered website. The module currently provides four methods for detecting spam: a trainable Bayesian filter, support for manually entered custom filters, the ability to count number of links in content, and detection of content posted from open email relays.

Spam, what and why?

Generally speaking, spam is any unwanted content posted to a website that is unrelated to the subject at hand. A web search for "spam comments" will turn up a large number of discussions on the phenomenon, a growing annoyance on the world wide web. Spam comments usually take the form of advertising and links back to the spammer's own website, often posted with the use of automated tools. The spammer's goal is usually to increase their ranking in google and other search engines.

Bayesian filter

The spam module implements a php-native Bayesian filter which performs statistical analysis on spam content. It counts which words appear more often in spam content and which words appear more often in non-spam content, and then with this information determines the probability that new content is or is not spam.

Upon initial installation, the spam module naively assumes that all content is not spam. Each new comment and other content posted to the site will be marked as not spam, and it is up to the site administrator to teach it when it makes mistakes. Teaching is as simple as clicking "mark as spam" or "mark as not spam", whichever is appropriate. The module then breaks the posting up into words which are stored in the database for future reference. It operates on the rule that the more a given word shows up in spam content, the higher the probability that future content with that same word is also spam.

As most spam comments are trying to increase their search engine ranking, the most revealing piece of information contained is usually a link back to their website. Some spammers actually cut and paste earlier comments from the same webpage, with the only new content being a link back to their website. Because of this, the filter provides special handling for domain names. Any new comment posted that contains the domain name of a known spam site will itself be marked as spam. Spammer domains are automatically learned from previous spam comments. An administrative page is provided for managing automatically learned spammer domains, allowing you to add additional domains or to edit and delete existing ones.

In practice, I have been using the plain Bayesian filter on KernelTrap.org without even special URL handling for a couple of months. It took teaching the filter 36 spam comments before it was able to catch its first true spam posting in the wild. Since that time I've seen another 20 spam comments, and it's automatically caught nearly half of them. With further training, I expect this precentage to continue to increase, ideally to 95% accuracy or better. However, realistically a Bayesian filter alone is probably not sufficient, thus the addition of additional spam detection tools in the 4.5 version of the module.

Custom filtering

Custom filters provide site administrators with the ability to blacklist, whitelist, or greylist new posts based on the matching of words, phrases, and regular expressions. The module tracks how often each of your custom filters match against new content, allowing you to determine their effectiveness.

URL limiting

Spam content often contains an abnormally large number of links, all in an effort to increase their search engine rankings. The spam module can be configured to count the number of links in each new posting, and if more than your specified limit, the posting can be marked as spam. A threshold can be defined for total links, as well as for how many times the same link shows up.

Distributed server boycott list

Finally, the spam module can be enabled to look up the poster's IP address in the distributed server boycott list. If the IP is found, it is known to be an open relay or otherwise untrusted email server, and thus the comment will be marked as spam. The theory is that email spammers are probably also comment spammers.

Current development

As several new features were recently added to the module, current development is minimal as the goal now is to see how it performs and to fix any new bugs that might turn up. Of course, effort will also be focused on attempting to optimize the logic, and to generally cleanup the code. Finally, there is a need to add watchdog logging to the module.

Spam mailing list

A mailing list has been created primarily to discuss the development of the Drupal spam module. However, anyone interested in discussing the spam problem in general and how it affects Drupal, or even in developing an alternative module for dealing with spam, is fully welcome and encouraged to join the mailing list. Full details can be found here.

Future development

When I originally started working on this module, my main goal was to learn how a Bayesian filter works. In doing my research, I learned of Markovian tokenizing in which phrases are examined instead of just words. While implementing this functionality could result in a more effective Bayesian filter, the overhead it would introduce doesn't seem worth it. As I begin to see spam comments that are cut & paste identical to non-spam comments, I'm more and more convinced that improving the tokenizer to better locate URLs and domain names is much wiser investment of effort.

I've also considered adding more actions to the module. Currently it can "auto-unpublish", and it can "notify the site administrator", and that's it. Other actions could include blacklisting the IP address (or user, if posted from a user account), preventing the spam from being posted in the first place, or interfacing with comment moderation to push suspicious comments into the moderation queue. None of these ideas are currently being actively pursued.

Finally, it would be wise to review the solutions available for other CMS's, such as Movable Type's MT-Blacklist and WordPress's numerous solutions. The problem is obviously not unique to Drupal, and we can learn a lot from other people's efforts.

Wishlist

The top of my wishlist is to get UI experts involved to help improve overall usability of the module. For example, the existing "mark as spam" and "mark as not spam" text links add significant clutter to the link section of posts. One thought I've had is to replace them with small icons. Additionally, as more functionality is introduced, more configuration options get introduced, and this can lead to general confusion. Perhaps effort could be made pick the logical defaults and reduce the number of configurable options. It would be interesting to compare this module to other open source solutions for other CMS's and to compare usability.

My second wishlist item would be to merge some spam filtering functionality into the Drupal core. I'm thoroughly convinced that it's just a matter of time until all website owners have to regularly deal with spam, just like all email owners currently have to deal with spam.

Summary

The collection of tools that comprise the spam module should prove quite effective in beginning to battle the rising tide of spammer comments, but it's certain to be an ongoing effort. If you're interested in getting involved, consider subscribing to the mailing list and joining the discussion.

Original Article:

Fighting back at spam