Elixi.re Move Downtime 2020-07-22
This is an incident report on our blog page. Page isn't automatically refreshed, please hit F5 yourself. We do post live updates on the elixire discord guild.
Result
This move was successful.
Total upload downtime was ~27 minutes, and total access downtime was ~14m. Some domains are facing more downtime due to Cloudflare.
Announcement:
Hello everynyan!
The big day has come, I'm finally feeling like moving elixire from edgebleed to laserjet serber. The downtime will happen in 2.75 hours, at 5PM UTC. Expected downtime is 30 minutes, but may be shorter or longer than that. Images will still load, but no new API activity will be allowed during this time (new uploads, registers, deletions etc).
What's so important about this move, I hear you asking, here, let me explain it with a dead meme
As I should be a responsible adult, I'll note that it doesn't have an entire /24 or /32 to itself
Checklist
Done:
- Do initial setup of elixire on laserjet-elixire
- Do initial setup of firewall on laserjet-elixire
- Announce downtime
- Do initial rsync of images from edgebleed-elixire to laserjet-elixire
- Do nginx setup, port it for new setup on laserjet-elixire
- Prepare readonly mode on edgebleed-elixire
- Install pm2 on laserjet-elixire
- Set up backups on laserjet-elixire
- Do rsync roughly every hour until the downtime
- Go on read-only mode on edgebleed-elixire (
rm /etc/nginx/sites-enabled/elixire.conf;ln -s /etc/nginx/sites-available/elixire-ro.conf /etc/nginx/sites-enabled/;systemctl reload nginx
) - Verify that elixire is on RO mode
- Disable backups task on edgebleed-elixire (
systemctl disable --now elixirebackupdb.timer; systemctl disable --now elixirebackup.timer
) - One last rsync
- Dump db on edgebleed-elixire (pg_dumpall?)
- Copy db dump from edgebleed-elixire to laserjet-elixire (wormhole?)
- Import db (psql -f dumpfile postgres)
- Start elixire manually on laserjet-elixire
- Move reverse records on dabbox, reload nginx
- Verify everything works
- Move elixire instance to pm2
- Announce uptime
- Start backups task on laserjet-elixire
- Whoops: Fixed an instance where uploads failed for non-admin users due to missing clamav, clamdscan and clamav-daemon packages (plus
systemctl enable --now clamav-daemon
). Caused 5 minutes of upload downtime. - Low priority: Configure nginx on laserjet-elixire and do more DNS tweaking so that traffic from CF doesn't go through dabbox
- Verify that DB backups work
- Fix DB backups (install jq, import keys of devs for encryption, change script so that ave's latest key is used (whoops!))
- Whoops: Fix an instance where elixire was inaccessible due to nginx having issues, and also due to me forgetting to paste in some lines. It's all good now. Caused a downtime that took upwards of 12m. This is still ongoing on some domains, but is caused by cloudflare caching. Requests time out to these domains.
- Verify that image backups work (I had to move to latest duplicity for reasons I explained here)
Running:
- We're (Well, Luna is) investigating an issue with clamav.
To Do:
(none)