Diagnosing Slow Drupal Operations Using Linux strace
The other day, we were helping a long time client with setting up a new development server configured with Ubuntu Server LTS 16.04, which comes with PHP 7.x. Benchmarks of PHP 7.x show that it is faster than any PHP 5.x version by a measurable margin, hence the client's attempt to move to the newer version of Ubuntu and PHP.
But when we tried benchmarking the new server against the existing server, which has Ubuntu Server LTS 14.04, showed that the new server is extremely slow compared to the existing one.
We set up investigating where the slow down is coming from, and running a quick version of our Drupal Performance Assessment, which is one of our many Drupal services.
We started by checking out the logs, which on this site is set to syslog, so as not to bog down the database.
We saw the following messages:
Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP Status: 0; Message: Request failed: Connection timed out. TCP Connect Timeout.; Response: ; Request: GET /solr/wso/select?start=0&rows=8&fq=bundle%3A%28blog%20OR%20faq%20OR%20...&wt=json&json.nl=map HTTP/1.0#015#012User-Agent: Drupal (+http://drupal.org/)#015#012Connection: close#015#012Host: solr2.example.com:8983#015#012#015#012; Caller: module_invoke() (line 926 of /.../www/includes/module.inc)
Nov 19 16:36:31 localhost drupal: http://dev5.example.com|1478727391|Apache Solr|1.1.1.1|http://dev5.example.com/some-page||0||HTTP 0; Request failed: Connection timed out. TCP Connect Timeout.
As you can see, the server is timing out trying to connect to the Apache Solr service. The existing server does not have this message.
But is this just an error, or something that impacts performance?
There are a variety of tools that can help profile PHP's page load, and see where time is being spend. The most popular tool is xhprof, but setting it up is time consuming, and we needed simpler tools.
So, we settled on a well tried and tested Linux tool, strace. This tool allows one to see what system calls a process is issuing, and measure the time each takes.
Over the years we have used strace, with tried and tested options to give us what we need quickly, without having to install extensions in PHP, and the like.
The command line we use discovers the process IDs of PHP on its own, and then traces all of them:
Caution: NEVER USE THIS ON A LIVE SERVER! You will slow it down considerably, both the CPU and disk access!
strace -f -tt -s 1024 -o /tmp/trace -p `pidof 'php-fpm: pool www' |
sed -e 's/ /,/g'`
Let us explain the options for a bit:
- -f means follow children. This means that if a process forks and creates a new process, the new process will be traces as well.
- -tt outputs a time stamp accurate to microsecond time. This is extremely useful for knowing where time is spent.
- -s means that for most system calls, such as read, a longer output is displayed.
- -o gives the file name to output the trace to.
- -p is a list of process IDs to trace. In this case, we use some shell magic to get the process IDs of the PHP daemon (since we are running PHP-FPM), and then replace the spaces with commas. That way, we don't have to find out the process IDs manually then enter them on the command line. A great time saver!
We run this trace and issue a single request, so the output is not huge. We terminal the strace process by the usual Ctrl-C.
We inspect the output and we find this:
9359 17:02:37.104412 connect(6, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
From the port number, we know this is a connection to the memcached daemon, which is running on the same server. No issues there.
Then we see this:
9359 17:02:37.176523 connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/mysqld/mysqld.sock"}, 29) = 0
That is the connection to MySQL's socket. No issues either.
Now, we see this (IP address obfuscated):
9359 17:02:38.178758 connect(9, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("1.1.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
From the port number, we know that this is Apache Solr.
Then we find the following, repeating more than a hundred times:
9359 17:02:38.179042 select(10, [9], [9], [], {1, 25000}) = 0 (Timeout)
9359 17:02:39.205272 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.210606 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.235936 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:39.241262 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:39.266552 nanosleep({0, 5000000}, NULL) = 0
Until we see this:
9359 17:02:43.134691 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.160097 nanosleep({0, 5000000}, NULL) = 0
9359 17:02:43.165415 select(10, [9], [9], [], {0, 25000}) = 0 (Timeout)
9359 17:02:43.190683 nanosleep({0, 5000000}, NULL) = 0
Did you catch that, or did you miss it? Look closely at the time stamps!
The PHP process spend a full 5 seconds trying to contact the Apache Solr server, but timing out!
No wonder page loading is so slow then.
In this case, there was a block of 'related content' populated from an Apache Solr query. This block was not cached, and for every logged in user, the PHP process waits for the Apache Solr server round trip time, plus query execution time. But worse is if the server times out, because of network issues.