Why is Module X slow? Or is it really module Y? Debugging performance at scale
Its an exciting time to be working on Drupal.org! Content strategy is being implemented, and thus we've been evaluating and implementing modules, including:
- Organic Groups
- Panels
- Page Manager
- Panelizer
- Flag 3.x (already implemented for spam)
There are numerous reasons why we're looking at using these modules, but one of them is because we're able to see them in action on events.drupal.org. Panelizer was being called out as one of the least performant of these modules. This was due to a slow query that was observed at Drupalcon Barcelona:
SELECT * FROM {panelizer_entity} WHERE entity_type = :entity_type AND entity_id IN (:ids)
For each entity that gets loaded, panelizer needs to grab the associated panelizer_entity ID and load it up for preperation to view. While this isn't a problem for a few nodes, it doesn't scale when you're querying 100s or thousands of them:
2015.09.23.mysql-slow.gz:2015-09-23T09:41:14.147976+00:00 db5 mysql-slow: EXPLAIN SELECT * FROM panelizer_entity WHERE entity_type = 'node' AND entity_id IN ('171', '177', '179', '181', '183', '185', '189', '193', '199', '201', '209', '211', '213', '215', '217', '219', '221', '225', '227', '229', '233', '235', '239', '243', '245', '247', '249', '251', '253', '255', '277', '279', '281', '287', '289', '291', '293', '295', '297', '301', '303', '305', '307', '309', '311', '313', '317', '319', '321', '323', '325', '327', '335', '337', '343', '347', '349', '351', '353', '359', '363', '367', '381', '385', '389', '391', '397', '399', '403', '405', '407', '409', '411', '413', '415', '417', '419', '421', '423', '427', '429', '431', '433', '435', '437', '439', '441', '443', '449', '451', '453', '455', '459', '461', '463', '465', '467', '469', '471', '473', '475', '477', '479', '481', '483', '485', '489', '495', '497', '499', '501', '503', '505', '507', '509', '511', '513', '515', '517', '519', '521', '523', '525', '527', '529', '531', '533', '535', '537', '539', '541', '543', '545', '547', '549', '551', '553', '555', '557', '559', '561', '563', '565', '567', '569', '571', '573', '575', '577', '579', '583', '585', '589', '591', '593', '595', '597', '599', '601', '603', '605', '607', '609', '611', '613', '615', '617', '619', '623', '631', '633', '635', '637', '639', '641', '643', '645', '647', '649', '651', '653', '655', '657', '659', '661', '663', '665', '667', '669', '671', '673', '675', '677', '679', '681', '683', '685', '687', '689', '691', '693', '695', '697', '699', '701', '705', '707', '709', '711', '713', '715', '717', '719', '721', '723', '725', '727', '729', '731', '733', '735', '737', '739', '741', '743', '745', '747', '749', '751', '753', '755', '757', '759', '761', '763', '765', '767', '769', '771', '773', '775', '777', '779', '781', '783', '785', '787', '789', '791', '793', '795', '797', '799', '801', '803', '805', '807', '809', '811', '813', '815', '817', '819', '821', '823', '825', '827', '829', '831', '833', '835', '837', '839',
Additionally, there is this issue: https://www.drupal.org/node/2023703 -- this discusses the potential issues around loading the panelizer settings and some solutions, including adding a lazy loader to get the entity but not load up all the settings until say, view time.
It is apparent that we have an issue on our hand. But was it panelizer?
First -- I would suggest watching Ezra Gildesgame's talk on XHprof. It will give you some ideas about how to troubleshoot performance issues.
Secondly, I'd suggest getting a monitoring tool to average your performance over time. This is also possible to do with XHprof, but thats for another post.
At the Association, we use New Relic for server monitoring. While the full version can give you some really detailed results, our version gives us an overview. Here is a snapshot of what was happening when panelizer was being suspected as slow:
When we take a look at the modules that are being executed, it gives us a little more information about what modules are going slow:
If you don't have New Relic, you can still get this data by setting up XHprof and going to the affected page. It should give you similar results. One nice feature of New Relic is that it gives you XHprof (ish) results for past performance problems. Unless you're profiling every page (which is not recommended for production), you won't have a snapshot result of what happened when the site was going slow.
Looking at the New Relic logs and the query, we know three things:
- This issue is happening on node_add
- This issue is affecting nodes that are related to sessions only. (If you pop in those node ids in the query to events.drupal.org/node/XXX)
- Entityreference module load time correlates to the database slow query time.
So what things could cause multiple session nodes to occur?
- View of many nodes, like a schedule
- field or panel on a node/add page that is making a query of entities
- Session counts (although that would be quite unperformant, loading a node just to get its count.. but I've seen worse things happen)
- an entity reference field
By looking at the New Relic logs, there isn't any reference to the view module causing a problem. This is also during node_add, which doesn't call any views. While cod_sponsors or products might be an (separate) issue, they their peak loads are not occuring at the same time as the db query. That leaves us with entity reference. This hits as the top module at the same time as the panelizer query is being executed.
What node/add pages are using entity reference for sessions?
Only one. Session Evaluations.
Here we are using entityreference prepopulate, which shows the session thats being evaluated. But, there is a problem with this...
So there is the issue. Why didn't this occur in LA? because there were half as many sessions. Its also why we never saw it in any previous drupalcon, because the sites were singular so performance issues like this wouldn't ever have enough content to show its face.
Conclusion / Take aways
- While there might be some architecture tweaks that could be done to Panelizer, it was not the key issue.
- The issue is the combination of modules, exasperated by the amount of content. Entityreference + Entityreference Prepopulate (Disabled select field) + Panelizer + Two cons worth of content = Problem.
- Don't assume a module is slow or unperformant from a single query.
- When an issue is found, don't guess, investigate.
- XHProf and New Relic are essential for large drupal sites.
A simple tweak to the entity reference field has fixed the problem for now. Before the next con, we'll implement some better methods of showing the title so this doesn't occur again. In fact there is already an issue for it.
Of course, this is just one potential issue - but this method demonstrates that a monitoring tool like New Relic plus performance profiling tools like XHProf can allow you to find unexpected interactions between modules that may result in performance problems at scale. Luckily, our fixes translate into fixes for the whole community. This means not only a better drupal.org, but a better Drupal product.