Revisiting Drupal Page Cache
Hopefully, almost all Drupal devs and site maintainers are familiar with Drupal's page caching mechanism, but I'll provide a brief high-level overview for the uninitiated and those in need of a refresher:
Page cache overview
Drupal dynamically generates all of its pages, weaving through innumerable hooks, alters, and preprocess functions to finally print out a formatted HTML page. The whole process can be slow and resource intensive, so Drupal offers page caching to mitigate the performance issues.
At the highest level, Drupal's page cache works by mapping a URL string (e.g. http://example.com) to its final HTML output (<html>...</html>); in this way, Drupal dynamically generates the page once and caches it, allowing all subsequent users to bypass page generation, and instead get static HTML direct from the database. This cache is periodically expired, forcing the page to be dynamically built and cached again.
Because Drupal is able to react to different properties based on the user (permissions, profile values, etc.), this caching functionality is limited to anonymous traffic only.
Configuring page cache
Drupal also offers a few ways to configure this behavior.
In Drupal 6, "caching mode" lets you choose between disabled, normal, and aggressive, where "aggressive" just means that normal Drupal tasks performed at the very beginning and very end of a page's load are ignored (developers, that's your hook_boot and hook_exit); this makes the database query to retrieve the cached content the only task performed on any given page load.
Additionally, you can decide whether to gzip compress the HTML markup with the "Page compression" setting; this can provide savings in bandwidth and download times.
In Drupal 7, you can set "expiration of cached pages," which tells external caching mechanisms like Varnish (but also most modern browsers), how long to hang on to a local copy of the webpage before requesting a new one.
Minimum cache lifetime
One commonality between Drupal 6 and Drupal 7 is the "minimum cache lifetime" setting.
Drupal 7 describes this setting by stating, "Cached pages will not be re-created until at least this much time has elapsed." You would think that this would mean that each page will persist in the database at least as long as the value of this setting. In reality, this is patently false. Minimum cache lifetime doesn't apply to each cache entry, it applies to the page cache bin as a whole.
Suppose you set it to 3 hours (and you have cron properly running); the entire page cache table will be truncated once every 3 hours. If it was last truncated at noon and a particular page was cached at 2:55pm, rather than lasting the 3 hours to 5:55pm, it's wiped along with everything else at 3:00pm.
What's worse, if you have no minimum cache lifetime set, but your cron runs more frequently (say, once every half hour), the whole cache table gets truncated every cron run.
Depending on the settings, for smaller sites with little-to-modest traffic, this could mean that virtually all pages are being served uncached.
What do we do?
For very large websites, the Varnish HTTP accelerator is a the de facto Drupal page cache replacement. For websites that fit into the small-to-mid-size range, there are other options to explore.
Avoid page cache
In the contrib space, there are a couple of options, the most prominent being Boost. Rather than caching HTML pages to the database, it writes pages to static HTML and pulls from the files directory. It also includes crawler. I haven't personally used this module, so I can't comment on its effectiveness, but it's important to note that configuration of this module is non-trivial.
Run a crawler
If you're comfortable working in the command line, you can run a crawler. The idea is to wrap your existing call to Drupal's cron in a script that checks whether the page cache will get flushed. If so, it runs a crawler immediately following execution of cron. I have a future post planned on this topic, but you can get my thoughts on the matter in the bash snippet below:
CACHE_LIFETIME=`drush vget cache_lifetime | sed s/[^0-9]//g`LAST_CACHE_FLUSH=`drush vget cache_flush_cache_page | sed s/[^0-9]//g`if [[ $[`date +%s` + 60] > $[CACHE_LIFETIME + LAST_CACHE_FLUSH] ]]; then WARM_CACHE=1;fidrush cronif [[ $WARM_CACHE ]]; then wget --quiet http://example.com/sitemap.xml --output-document - | egrep -o "http://example.com[^<]+" | while read line; do curl -A 'MyUserAgent/1.0' -s -L $line > /dev/null 2>&1 echo $line donefi
Fix the bug
This Core thread addresses the problem directly; as of this writing, there's a patch against Drupal 7 waiting for review. If you're able to review and test the patch, please do! Sorry Drupal 6 developers, it seems unlikely that this will be backported.
As for Drupal 8, I can't say for sure what page cache will look like. As a result of the WSCCI initiative and all of the changes it's making to page routing, as well as Symfony integration, consensus seems to be that Drupal's page cache will be re-written from the ground up to use Symfony's HTTP Cache, though as of yet, I haven't seen any major work in that area. Eager contributors, this is your Drupal/Symfony HTTP Cache thread.
Further Reading