Drupal Performance and Scalability With Excessive I/O Load Or Memory Exhausion
On many occasions, we see web site performance suffereing due to misconfiguration or oversight of system resources. Here is an example where RAM and Disk I/O severely impacted web site performance, and how we fixed them.
A recent project for a client who had bad site performance uncovered issues within the application itself, i.e. how the Drupal site was put together. However, overcoming those issues was not enough to achieve the required scalability with several hundred logged in users on the site at the same time.
First, regarding memory, the site configured too many PHP-FPM processes, and that left no room in memory for the filesystem buffers and cache, which help a lot with disk I/O load.
Here is a partial display from when we were monitoring the server before we fixed it:
As you can see, the buffers + cache + free memory all amount to less than 1 GB of total RAM, while the used RAM is over 7GB.
used
buffers
cache
free
7112M
8892k
746M
119M
7087M
9204k
738M
151M
7081M
9256k
770M
125M
7076M
4436k
768M
136M
7087M
4556k
760M
133M
We did calculations on how much RAM is really needed by watching the main components on the server:
In this case the calculation was:
Memcache + MySQL + (Apache2 X number of instances) + (PHP-FPM X number of instances)
And then adjusting the PHP-FPM number of processes down to a reasonable number, for a total application RAM of no more than 70% of the total.
The result is as follows. As you can see, used memory is now 1.8GB instead of 7GB. Free memory will slowly be used by cache and buffers making I/O operations much faster.
used
buffers
cache
free
1858M
50.9M
1793M
4283M
1880M
51.2M
1795M
4258M
1840M
52.1M
1815M
4278M
1813M
52.4M
1815M
4304M
Another issue with the server, partially caused by by the above lack of cache and buffers, but also by forgotten settings, caused a severe bottleneck in the Disk I/O performance. The disk was so tied up that everything had to wait. I/O Wait was 30%, as seen in top and htop. This is very very high, and should usually be no more than 1 or 2% maximum.
We also observed excessive disk reads and writes, as follows:
disk read
disk write
i/o read
i/o write
5199k
1269k
196
59.9
1731k
1045k
80
50.7
7013k
1106k
286
55.2
23M
1168k
607
58.4
9121k
1369k
358
59.7
Upon investigating, we found that the rules_debug_log setting was on. The site had 130 enabled rules and the syslog module was enabled. We found a file under /var/log/ with over a GB per day and growing. This writing of rules debugging for every page load tied up the disk when a few hundred users were on the site.
After disabling the rules debug log settings, wait for I/O went down to 1.3%! A significant improvement.
Here is the disk I/O figures after the fix.
disk read
disk write
i/o read
i/o write
192k
429k
10.1
27.7
292k
334k
16.0
26.3
2336k
429k
83.6
30.7
85k
742k
4.53
30.8
Now, the site has response times of 1 second or less instead of the 3-4 seconds. ArticlesDrupal PerformanceDrupal PlanetDiskMemory