#AberdeenCloud - what happened?
#AberdeenCloud - what happened?
Language
English
#AberdeenCloud - what happened?
The only cloud with no silver lining.
Thu, 2016-06-30 09:54By greg
If you were an #AberdeenCloud customer, you’ll be only too aware their platform went bang on the evening of 28th June 2016. Spectacularly.
As it happens, we had spotted the lack of support response, and while there had not (and still has not, I might add - they still have a Sign Up page!) been any communication from #AberdeenCloud that anything was wrong, we were starting to get a little nervous. It took time to ask our customers what they wanted to do, collate the responses, vet a new supplier (our ISO 27001 certification requires we procure carefully), negotiate contracts, and so on, but we had got there.
Having signed a contract with Platform.sh just the day before, we were about to start migrating #AberdeenCloud customers over. Unfortunately, as it turned out, we were just a few days too late.
So what happened? Our timeline of events went something like this:
-
On 28-06-2016 at approximately 1900 UTC we got alerts from Pingdom for one of our customers on #AberdeenCloud.
-
It had happened earlier in the day, but restarting the container had cleared the issue - we figured it was something that was triggered by a Drupal cron event, but had not yet managed to investigate. So we did the same again, restarted the container. It did not come back. This was the first sign all was not well.
-
Then another site went sideways, so we tried to restart that container. Same happened.
-
At this point we realised trying to restart containers was making things worse. To test we tried to restart the container on a development site we didn’t care about, same happened.
Right about now we realise things are not at all well with the #AberdeenCloud platform. OK, time for an emergency migration then! No one sleeps tonight! So we:
-
Raised an emergency ticket with #AberdeenCloud support (still not responded to, of course).
-
Tried to pull backups from the #AberdeenCloud backup manager (which was still available to us), but it failed for a site with no container running.
-
Tried to pull a stage backup instead, that failed too.
-
Tried to pull a backup from a seemingly still healthy site (it wasn’t healthy, it was just entirely cached by Varnish as it happened) and that also failed.
-
Tried to use `drush` to get databases, but found Drupal sites had no configuration files any more and could not connect to their databases.
And this is when something horrifying became apparent. All those daily backups #AberdeenCloud had been taking for us - and they did work, we had cause to use them just the week before - were, for some unknown reason, taken out by the same platform failure! No backups!
At this point it was time to see if we could pull anything off the running services. We noticed pretty quickly that:
-
phpMyAdmin was still running for all sites, even the ones with dead containers, so we used its “Export” feature to grab all live databases.
-
Version control was still running, so we quickly updated all codebase copies to make sure we had the latest code.
So now we have code and databases, which is good, but still no files. The containers we’d tried to restart were gone. Nothing we could do, it’s dead, Jim. So we tried a few things:
-
We started trying to contact people - we got hold of people who used to work for #AberdeenCloud to see if they could help (they couldn’t) - we sent Sampo, the CEO, a message via LinkedIn pleading for help (he still hasn’t replied, probably never will) - but we ran out of road, couldn’t find anyone able or willing to step in.
-
While the communication effort was going on, we started trying every trick we could think of to get files from “good” containers, via SSH, via the `aberdeen` command line client, via SFTP, copying files to another location to pull down, nothing worked.
-
We also tried spidering the sites using the `wget` command for Linux, to pull as many assets as possible from remaining Varnish caches, but this had very limited success.
At this point we started looking at the root cause, and it became pretty clear the mounted directories containing client uploaded files and Drupal configuration files were no longer there. You could see it when you logged into a container, but it was just a cache. If you tried to check the disk space it didn’t even show up. That storage was just gone.
But far worse, it seems backup storage depended on the exact same service! This is quite astounding really, but it seems there was no separation of service between storage of backups and storage of files and configuration. So if you lose one, you lose the other. Whereas you would expect backups would be somewhere more resilient and, frankly, more simple to access. We’d never had cause to question this before, our backups had always worked and the platform was closed source - there’s no way we could’ve known this was the case, but there you go. Cloud files gone, sites gone (because no Drupal settings) and backups gone - all in one fell swoop!
We were left with no choice but to proceed with what we had, so we:
-
Restored Drupal 6 sites to a virtual machine we had spare (I don’t think Platform.sh supports PHP 5.3, though I may be wrong).
-
Restored Drupal 7 sites to equivalent Platform.sh accounts.
-
Pulled in as many files as we could.
-
Continue to work with customers to help them recover their files and websites.
Anyway, it’s a real shame, because (unknowable backup storage flaw aside) they provided a good service for several years and the platform had proven to be very solid. It’s quite beyond me how someone can allow a business like this to fly into the ground without so much as giving customers a shutdown date with reasonable notice. It is irresponsible beyond belief, but there you go, it happened, and now we have to live with the consequences.
We, at Code Enigma, are obviously very sorry this has impacted on some of our customers. We are doing our best to help people recover their sites, automating the import of files, pulling assets from archive.org, checking support developer local copies for missing data, and we continue to chase Sampo, offering payment for missing files, though I have no confidence he’s ever going to reply.
I will post an FAQ later on other aspects of fallout from this, to help people understand the situation more clearly.
PageHosting Service Comparison
BlogDenial of Service, what you need to know