Speeding up Complex Drupal Data Loads with Custom Caches
Recently we had the task of loading data from a content type with 350 fields. Each node is a University’s enrollment data for one year by major, gender, minority, and a number of other categories. CSV exports of this data obviously became problematic. Even before we got to 350 fields, with the overhead of the Views module we would hit PHP timeouts when exporting all the nodes. If you’re not familiar with Drupal's database structure, each field’s data is stored in a table named ‘field_data_FIELDNAME’. Loading an entire node means JOINing the node table by entity_id with each related field table. When a node only has a handful of fields, those JOINs work fine, but at 350 fields the query runs slow.
On this site we’re also plotting some of the data using highcharts.js. We really hit a wall when trying to generate aggregate data to plot alongside a single university's. This meant loading every node of this content type to calculate the averages, which turned our slow query into a very slow query. We even hit a limit on the number of database JOINs that can be done at one time.
In retrospect this is a perfect case for a custom entity, but we already had thousands of nodes in the existing content type. Migrating them and implementing a custom entity was no longer a good use of time. Instead, we added a custom table that keeps all the single value fields in a serialized string.
The table gets defined with a hook_schema in our module's .install file:
function ncwit_charts_schema() {
$schema['ncwit_charts_inst_data'] = array(
'description' => 'Table for serialized institution data.',
'fields' => array(
'nid' => array(
'type' => 'int',
'default' => 0,
'not null' => TRUE,
'description' => 'node id for this row',
),
'tid' => array(
'type' => 'int',
'default' => 0,
'not null' => TRUE,
'description' => 'intitution term id that this data belongs to',
),
'year' => array(
'type' => 'int',
'default' => 0,
'not null' => TRUE,
'description' => 'school year for this node',
),
'data' => array(
'type' => 'blob',
'not null' => FALSE,
'size' => 'big',
'serialize' => TRUE,
'description' => 'A serialized array of name value pairs that store the field data for a survey data node.',
),
),
'primary key' => array('nid'),
);
return $schema;
}
The most important part of the array is 'data' with type 'blob', which can be up to 65kB. Not shown is another array to create a table for our aggregate data.
When a new node is saved hook_node_insert() is invoked. hook_node_update() fires both when a new node is saved and when it's updated.
/**
* Implements hook_node_insert().
* save serialized field data to inst_data table for a new node
* For a new node, have to use this
*/
function ncwit_charts_node_insert($node) {
ncwit_charts_serialize_save($node);
}
/**
* Implements hook_node_update().
* save serialized field data to inst_data table
*/
function ncwit_charts_node_update($node) {
if (isset($node->nid)) {
// we're also calling this function from hook_node_insert
// because hook_node_update doesn't have the nid if is a new node
ncwit_charts_serialize_save($node);
}
else {
return;
}
}
Now we actually process the fields to be serialized and store. This section will vary greatly depending on your fields.
function ncwit_charts_serialize_save($node) {
// save each value as a simple key => value item
foreach ($node as $key => $value) {
$data[$key] = $value[LANGUAGE_NONE][0]['value'];
}
$fields = array();
$fields['nid'] = $node->nid;
$fields['tid'] = $node->field_institution_term[LANGUAGE_NONE][0]['tid'];
$fields['year'] = $node->field_school_year[LANGUAGE_NONE][0]['value'];
$fields['data'] = serialize($data);
db_merge('ncwit_charts_inst_data')
->key(array(
'nid' => $node->nid,
))
->fields($fields)
->execute();
When a node is deleted we have some clean-up to do.
/**
* Implements hook_node_delete().
* Also remove node's data from inst_data
*/
function ncwit_charts_node_delete($node) {
if ($node->type !== 'data_survey') {
//only care about data_survey nodes
return;
}
$query = db_select('ncwit_charts_inst_data', 'i');
$query->fields('i')->condition('i.nid', $node->nid);
$result = $query->execute();
$data = $result->fetchAssoc();
if ($data > 0) {
db_delete('ncwit_charts_inst_data')->condition('nid', $node->nid)->execute();
}
}
When first installed or when fields get changed, we added a batch process that re-saves the serialized strings. Aggregate data is calculated during cron and saved in another table. Rather than loading every node with JOINs, the data comes from a simple query of this custom table.
Pulling the data out of the database and calling unserialize() gives us a simple associative array of the data. To pass this data to highcharts.js we have a callback defined that returns the arrays encoded as JSON. Obviously this gets more complicated when dealing with multiple languages or multi-value fields. But in our case almost everything is a simple integer.
This process of caching our nodes as serialized data changed our loading speed from painfully slow to almost instant. If you run into similar challenges, hopefully this approach will help you too.