National Novel Writing Month and Drupal Scalability

Drupal 5 scales really well

This surprised me. Having run into many of the common Drupal scalability problems on smaller sites with shared hosting, I expected much worse. In the process of migrating National Novel Writing Month to Drupal 5, I discovered that there are number of contrib modules, patches, hacks, and techniques that can be applied to allow Drupal 5 to scale to handle a medium-traffic, high authenticated/anonymous ratio, web site like ours, as long as you can live without modules that use the node_access table.

I don't know of any CMS that scales well right out of the box. That fact that someone with my (relatively low) skill set, working alone, was able to resolve major scalability issues in about a month really speaks to the quality of the Drupal codebase and, especially, the community.

Starting points

At the start of October we had 2 webservers running apache and a single MySQL server.
httpd.conf and my.cnf were reasonably well tuned
Drupal's htaccess was moved into an Apache conf file so that we could set AllowOverride none.
Boost was installed and configured so that cached static pages would be served from the filesystem instead of the database.

The first wall - node_access

Every year NaNoWriMo gets a small traffic peak on October 1 when the new site launches, and an enormous traffic peak on November 1. October is an order of magnitude more traffic than we get for the rest of the year, and November 1 is an order of magnitude higher than that (which quickly ramps down to around double the October traffic level for the rest of November)

After the 10/1 uptick, it became clear that at ~300 logged in users the site would slow dramatically. This meant a slow site for most of the daytime hours in the US.

The webservers were underutilized, and the DB server's CPU was pinned. Looking at the slow query log, we saw lots of node access queries.

I was aware that node access isn't known to scale gracefully. We were using 3 nodeaccess modules - forum_access, og, and nodeaccess. Since we also use node profile, and have over 100,000 entries in the user table, that meant a lot of rows in the node_access table (by then end of the event we had ~100K forum nodes, so disabling node profile wouldn't have helped in the long run).

It was easy for us to live without nodeaccess, and without the node_access features of og. However, forum_access was necessary for us. Fortunately, forum_access uses ACL for much of its access control - including its taxonomy query rewrite - only using node_access for access to individual nodes and to rewrite queries for other modules like search.

We removed all of the node_access code from forum_access, replacing it with hook_nodeapi for individual nodes, and patching node_search to repect forum_access's ACL entries when searching. I'm still working on getting the node_access-free version of forum_access to work without patching core, and will submit patches when it's ready.

Once we disabled node_access, things became much, much faster, and we were able to get to ~600 logged-in users without punishing slowness.

The second wall - LOWER

As November 1 approached, traffic slowly grew until the site was slow again for most of the US daylight hours. We found a lot of small changes that helped keep things reasonably accepatable:

Thanks to an unsolicited offer of help from Merlinofchaos and additional support from Robert Douglass, we were able to successfully set up the advcache and memcache modules to cache data for logged in users. This resulted in a noticeable increase in speed.
Indexes were added to the privatemsg author field and the buddylist buddy field which cleared up some slow queries.
Sessions, history, and users tables were switched to INNODB (our users update their profiles constantly) to try and avoid locking issues.
Search and tracker were disabled, as we were seeing many of their queries in the slow log.
Watchdog was disabled by commenting out it's SQL query, as we didn't feel it's benefit was worth the database usage.
We wrote an alternate set of forum_cache routines for advcache (we couldn't use advcache's standard forum_patch due to forum_access)
We patched boost to cache URLs with query strings

Things were faster, but they were still slow. It was a faster slow. Finally a couple of slow queries caught my attention:

<br>SELECT * FROM 'users' WHERE mail = 'user@example.com'<br>SELECT name FROM users WHERE LOWER(name) LIKE LOWER('username')<br>

The first query, issued by the 'Request new password' feature, was resolved by adding an additional index on the users.mail field.

The second query was coming from user.module (when checking the uniqueness of usernames on account signup) and a few other places. Although name is indexed, if you ask for LOWER(name), the index doesn't help. We have a very large user table and lots of sign-ups in October.

I patched user.module to remove the LOWER (it's unnecessary for MySQL and only exists for PostgreSQL compatibility), but later discovered a thread with some better options for fixing this: http://drupal.org/node/181625.

Removing LOWER resulted in a dramatic speed increase. Our Queries-per-second on the database server jumped dramatically and the CPU was never at 100% usage anymore. Finally, as traffic increased and November 1 drew closer, the webservers were the bottle neck instead of the database.. We were able to support ~1500 logged in users without slowness, and as many as 2000 without punishing slowness.

November

NaNoWriMo has a long and impressive history of crashing hard on November 1. It's really impossible for an organization with our meagre resources to afford the server and human resources necessary to cope with a massive traffic spike that happens one day a year. We didn't crash hard this year, but the web servers were swamped enough that some people got timeouts.

We did crash on 11/3, but this was due to a bug. I had written some bad date handling code that led to an infinite loop, and eventually an out of memory error for PHP. When many many apache children ran out of memory at once, apache was unable to recover.

We added a third webserver and put all the webservers behind a squid reverse-proxy. This combined with the decline in traffic after 11/1 put an end to our performance issues.

Another slow query caught our attention in the days after 11/1. It was the sess_gc() query from session.inc. Our sessions table was enormous and the query to delete old sessions was taking a long time. We reduced the gc_maxlifetime setting from the default 55 hours to 2 hours (we use Persistent Login, so anyone who wants to remain logged in after 2 hours of idle can use PL) and noticed another small improvement in response time.

Towards the end of the month I wrote some caching for our custom user search module that allowed us to reenable part of the site's search functionality.

Going forward

Our next steps will be to reenable search and tracker.

For search we're hoping to be able to write some caching that won't violate forum access.
We'll also look at views_fastsearch.
Finally, we'll apply one of the patches from the discussion at http://drupal.org/node/105639 to optimize the tracker module.

MySQL replication is another technique that would have saved us a lot of pain this year. We expect to switch to a replicated set-up for next year's event, and perhaps upgrade to 64-bit servers. I believe that this will allow us to remain responsive even with the November 1 traffic spike.

Notes
Advcache

We made some modifications to Advcache's comment caching: http://drupal.org/node/183256
We wrote an alternative to the Advcache's forum caching that doesn't conflict with forum_access: http://drupal.org/node/188005
Both of these will be cleaned up and submitted in the next month or so

Boost

Boost's .htaccess disables browser caching of images and other static files. It's worth patching this on a high-traffic site. http://drupal.org/node/185075
Boost doesn't cache pages with query strings by default (for example, node/123?page=1). Caching these helps a lot on a high-traffic forums site, where many nodes have many comments. http://drupal.org/node/182687
Boost doesn't play nicely with Persistent Login out of the box. http://drupal.org/node/186716
Boost's cron hook, as with any cron hook that deletes files on the webserver, doesn't work well if there are lots of files cached, and has trouble if you have multiple webservers that don't share a filesystem. It's worth writing your own cron job to delete expired Boost cache files.

Memcache

You should set up a separate bin for each cache_table, otherwise one module's 'cache_clear_all' will affect everything cached.
Memcache handles serialization differently than the core cache.inc. Memcache has a patch to handle changes to core for this, but any contrib module that caches will probably need to be patched to work correctly with Memcache - including Advcache, although there are plans to make Advcache aware of Memcache out of the box.

MySQL

The MySQL query_cache doesn't do a lot for us. The tables that would benefit from query cache are so large and get updated so often that, if we were to increase query_cache_size enough to hold a reasonable amount of data, we actually suffer worse performance. I plan to benchmark with query cache disabled, as I think it might be a liability for an active site with constant updates to node and user tables. http://www.mysqlperformanceblog.com/2007/03/23/beware-large-query_cache-...

Squid Reverse-proxy

We don't use a shared filesystem for our web servers. We replicate files with unison. In this case it's helpful to use the sourcehash option to cache_peer so that each user remains on the same webserver for their whole session - otherwise users who upload files might not see the file on the subsequent request unless they wait for the next sync to occur.
Squid respects the cache headers sent by the upstream web server, so it's worth thinking about which files should be cached for how long and editing your confs appropriately

Original Article:

National Novel Writing Month and Drupal Scalability