community.aegirproject.org
How this site crashed when it was last verified - lessons learned from migrating a "standard site"
We just had a pretty nasty outage on this site, about an hour ago (thanks to smthomas on IRC for the heads up!). All modules were disabled and the frontpage was unthemed and saying only "Page not found".
Ouch.
While distributions can make hosting and porting sites between providers a breeze (I didn't have to guess what Drupal modules I needed to install to import the site), it can have some tricky implications most people are not used to when doing certain operations. I have taken some time to explain here how I fixed this and how to avoid the problem in the future so that others can profit from that nasty experience...
After freaking out (Koumbit is the one taking care of hosting this site, so those things often fall on me), I started looking at the problem, and found out that all modules, or more precisely, all OpenAtrium-related modules were disabled. After a "what the heck" headbang, I realized I had the exact same issue back when we deployed the site during the original migration. Back then, the problem was that I didn't pass the --profile=openatrium
option to provision-deploy
when installing the new site. That, in turn, made the update script disable all OpenAtrium modules, because the Drupal bootstrap can't find them, because they are in the profiles/openatrium
...
(Now you're supposed to have that "aaaaaaah... i seeeeee!!" moment.)
I fixed our procedure (in french, a bit chaotic) by adding the --profile
option to the provision-deploy
call, which fixed it during that original deployment. But then, the site got imported, and it picked up the profile not from the settings.php
(which was correctly configured by provision-deploy
) but from the alias, which was created earlier with provision-save
, and which then defaulted to the default
install profile. On import, the site node was created to the frontend with the default profile.
So when the site was verified after a migrate, the profile was reset back to default
again and all modules were disabled when the cache was cleared.
The proper way of doing that deployment was to set the profile right in the alias (through provision-save
) in the first place - that way all would have been right.
If you ever end up in a similar situation, you have a few options:
- restore from a backup - not possible, we had live data in there and changes since the last backup
- enable all missing modules manuall - yuck: how do I know? i can look at "Disabled" (as opposed to "Not installed") modules in
drush pm-list
, but that's not really reliable - partially restore from backup - what I ended up doing
I took the original backup that was used to deploy the site in the first place and extracted the database.sql
:
tar zxf backup.tgz ./database.sql
Then I edited the file (in vi!!) to remove everything but the system table instructions (including the DROP TABLE
). And I loaded the dump in the site:system
drush @community.aegirproject.org sqlc < database.sql
That way, the system table, and only the system table, was restored from backups and all modules were back in their original enabled/disable state. A little cache clear (drush @community.aegirproject.org cc all
) and the site was fully functional again.
Bottomline: be careful with distributions when you move them around. Setting the profile in settings.php
and in provision-save
is essential for the site to work.