Drupal CDN & Static File Server - The Amazon S3 Way
Category: Internet / WebTags: DrupalS3File ServerCDN
If you build quality sites that attract a large number of visitors and interaction there eventually will come a point when you have to start looking for ways to offload your files and bring down your server overhead. I have been looking into the CDN issue off and on for the past 6 months. Recently I decided it was time to get something dialed in and move forward. I wanted something that required the least number of hacks and was easy and scalable. This post isn't meant to be an end all to Drupal and CDNs, but rather just some insight into the way I have tackled this issue for the time being.
There are a number of options to choose from and a lot of different ways to go about it. You could get a new server locally and load balance your stuff, you could get a new server locally and use it as a static file server, you could team up with a big time CDN like Akamai or Limelight and go that route, or you could go the less expensive CDN route with something like Amazon S3.
Note: This article doesn't attempt to explain every little detail of what's going on, but rather act as a guide for a developer to work off.
My Goals
Being somebody that is cost conscience and wanting to always try to get the most bang for my buck the idea of a CDN like Limelight didn't quite seem like an appropriate fit for me. I wanted something a little bit less serious that I could ease into. I don't care about having servers all across the world and getting content to people a few milliseconds faster than other solutions. I just want to take a huge load off my local network and put that load somewhere else. If my site is up and the users are happy, I'm happy.
I also don't care if 100% of my site files are not offloaded to the CDN. If I can take care of 95% of my load and leave 5% on my main box it really doesn't bother me. My main goal is to get rid of the majority of the overhead and keep everything scalable and dynamic with a relatively small code footprint.
Which files to offload
With these things in mind I have basically decided that a good route for me was to mainly take care of my two main sources of files: images and videos. If I really wanted to get hardcore I could also move my CSS and Javascript, but I don't see this as important as taking care of the major problems first.
Take my site www.gunslot.com for example. If you hit the homepage you will see that there are a lot of thumbnails and image requests. Each page probably has at least 20 image requests, some as much as 50. So I would rather take care of these requests first and then I can worry about the 5 or so CSS and JS requests later if I have to.
And then I have the video content which can be large in filesize and taxing on the server. This also had to go.
Local file system
Another thing to keep in mind is that I don't want to make my CDN act as my local file system. I don't want people to upload straight to this and only have this as my main system, I just see too many errors and bugs going this route. I would rather just have everything work normally and smoothly through the default local structure and just copy stuff over to the CDN, which brings up an important point.
Synching your content
Just how should you copy your content over to the CDN? Big time CDNs like Limelight allow HTTP synchronization (which Ted Serbinski talks about in his article) which basically copies your files over automatically. S3 does not offer this type of functionality so you will need to go another route.
You could simply copy them all at once programatically and call it a day. You could also maybe set up a cron and copy any new files every few minutes. But how do you know which files are new? How do you know which files get updated and which don't? You could run some PHP scripts like a cURL
or file_get_contents
to check the file's last modified time, but this has some big overhead (I tried). So when is the best time to copy the file and how should you do it? You most likely want to get the file over to the CDN as quick as possible, but at the same time you want to make sure if the user updates the file or something is changed your CDN reflects that.
My S3 Solution
I chose to go with a hybrid type approach. I basically send a file over to the S3 every time it is requested, if it is not already there, or if it is newer than the current file.
I have one main routing function that when called will run through the flowchart below and figure out what path to return for any given file, either:
Local: http://www.domain.com/files/full/path/myfile.jpg
or
S3: http://s3domain.com/files/full/path/myfile.jpg
So basically my flowchart goes something like:
The Technical
S3 Interfacing
So if any of you have messed with S3 you know that you are going to need a PHP class first to put your stuff there. I found a decent class off their forums and went with that (it requires PEAR, unfortunately). At one point I'll probably change this to something better if I find one, but it gets the job done for now.
Filetype check
This is just a quick little function that basically checks if the filetype is listed in an array of allowed filetypes that I have chosen, like: jpg, jpeg, gif, png, flv, mp3, etc.
If file exists
So this is one of the most important steps that I took to make this possible. Rather than actually checking if the file exists on S3 with one of the S3 class functions (slow) or via something like cURL
or file_get_contents
(slow) I went with a database table on my own server that keeps track of what is on the S3 server (fast). This table keeps track of every filepath and when it was created and changed on S3. I am able to use the changed timestamp from my table to check it against the local file's timestamp (filemtime
) to know when the file needs to be updated on S3.
Queuing the file for S3
Originally I didn't queue the file at all and just tried to put the file to S3 every time it was requested - mistake. It obviously took waaaaaaay to long to render my pages with this overhead so I opted for the queue method. I created another database table that keeps track of every file that needs to be put to S3. Basically every time a file doesn't exist on S3 I return the local path for the time being and add this file to the queue.
Putting to S3
As for putting the file to S3 there are a lot of ways to do this and it probably depends a bit on your situation. You could run a cron or some type of rsync if you really wanted to dial it in, but for the time being I am going for a bit more simpler method. I simply run 1 put operation at the end of each page request by the users. This seems to work really well right now and gets the files uploaded pretty much within seconds of when they are queued. At any time I have less than 20 or so files in the queue, and obviously once most of my files are on S3 this doesn't need to run anymore. I run my put function using hook_exit
and I pick 1 file each time and get it over to S3.
Routing the files with Drupal
So once you have all this ready to go how do you actually get Drupal to replace the current local paths with the S3 paths? Well, there are a number of ways to do this. If you want to replace all your images, CSS, logos and all that stuff you can patch common.inc
, file.inc
and theme.inc
(maybe more) to run through your routing function. Since I don't care too much about this stuff I skipped this part and decided just to replace imagecache and my videos (for now).
Imagecache
Imagecache comes with a sweet theme_imagecache
function that allows you to simply over-ride it in your template.php
. You basically just need to tweak the $imagecache_url
to run through your routing function and decide which path to return.
Notes
If you are using Imagecache for profile pics or nodes you may need to flush the imagecache when these are updated so that your new file will run through the queue and be uploaded to S3. This can be done via some hook_form_alter
or hook_nodeapi
calls that run the imagecache_image_flush
function.
Videos
My videos are done through a custom module so I can't really offer any help here. I just plugged my routing function into it.
So that's it!
As you can see for each file request that is going through the routing function it will always check to make sure the newest file is on S3. If it's not it just pulls the local file for the time being. As soon as the S3 file is ready it will start pulling that one. If you do an update it will pull the new local file until the new S3 file is ready again.
That's basically how I'm doing it right now and it's working really well. My server is thanking me and even with fairly high traffic the S3 costs are very reasonable. If anyone has any feedback or tips on how I could make this better lemme know!