Race conditions and caching variables
I would like to claim an utter hatred of race conditions. This is where code is written in such a way that it doesn’t fully consider the possibility of another thread (e.g. another website hit) or threads occurring concurrently. Consider the following which has been increasingly frustrating me recently:
Drupal stores variables in the ‘variables’ table. It also caches these in the ‘cache’ table so that rather than doing multiple SELECT queries for multiple variables, it simple gets all the variables straight out of the cache table in one SELECT then unserializes them.
cron_semaphore is one of these variables which is created when cron starts, then it deletes it when finishing. If it isn’t deleted it should mean that cron hasn’t finished running yet, so the next time cron tries to run it will quit straight away. But due to a certain race condition it doesn’t always get properly deleted as follows (p2 is an abbreviation for an unrelated process running concurrently, e.g. a visitor to your website):
1, cron starts, cron_semaphore variable inserted (and variables cache is deleted)
2. p2 starts, variables cache is empty so “SELECT * FROM {variables}” then…
3. cron finishes, cron_semaphore variable deleted and the variables cache is cleared
4. … p2 inserts result of “SELECT * FROM {variables}” into cache, but that SELECT was called before cron deleted the variable
5. you now have no mention of cron_semaphore in the variables table, but there it is back in the variables cache!
Consider many visits to your website concurrently and you soon realise this can become a very common occurrence. In fact, this exact problem inflicts us at least a handful of times every day. As a result cron keeps trying to run but immediately quits when it sees the semaphore variable still there. After an hour it finally deletes the semaphore but in the meantime crucial stuff doesn’t get done.
Web applications can quickly become riddled with race conditions such as these. I’ve spotted at least two more in Drupal’s core code in the past. When the ‘bugs’ occur as a result they can be tricky to pin down, appearing to be freakish random occurrences. Worse yet, even when found they can be a royal pain to untangle.