LavaTech Incidents

June 15, 2021 Outage Postmortem

Wed, 16 Jun 2021 08:50:39 +0000

Elixi.re Move Downtime 2020-07-22

Wed, 22 Jul 2020 15:31:22 +0000

This is an incident report on our blog page. Page isn't automatically refreshed, please hit F5 yourself. We do post live updates on the elixire discord guild.

Result

This move was successful.

Total upload downtime was ~27 minutes, and total access downtime was ~14m. Some domains are facing more downtime due to Cloudflare.

Announcement:

Hello everynyan!

The big day has come, I'm finally feeling like moving elixire from edgebleed to laserjet serber. The downtime will happen in 2.75 hours, at 5PM UTC. Expected downtime is 30 minutes, but may be shorter or longer than that. Images will still load, but no new API activity will be allowed during this time (new uploads, registers, deletions etc).

What's so important about this move, I hear you asking, here, let me explain it with a dead meme

~~As I should be a responsible adult, I'll note that it doesn't have an entire /24 or /32 to itself~~

Checklist

Done:

Do initial setup of elixire on laserjet-elixire
Do initial setup of firewall on laserjet-elixire
Announce downtime
Do initial rsync of images from edgebleed-elixire to laserjet-elixire
Do nginx setup, port it for new setup on laserjet-elixire
Prepare readonly mode on edgebleed-elixire
Install pm2 on laserjet-elixire
Set up backups on laserjet-elixire
Do rsync roughly every hour until the downtime
Go on read-only mode on edgebleed-elixire (rm /etc/nginx/sites-enabled/elixire.conf;ln -s /etc/nginx/sites-available/elixire-ro.conf /etc/nginx/sites-enabled/;systemctl reload nginx)
Verify that elixire is on RO mode
Disable backups task on edgebleed-elixire (systemctl disable --now elixirebackupdb.timer; systemctl disable --now elixirebackup.timer)
One last rsync
Dump db on edgebleed-elixire (pg_dumpall?)
Copy db dump from edgebleed-elixire to laserjet-elixire (wormhole?)
Import db (psql -f dumpfile postgres)
Start elixire manually on laserjet-elixire
Move reverse records on dabbox, reload nginx
Verify everything works
Move elixire instance to pm2
Announce uptime
Start backups task on laserjet-elixire
Whoops: Fixed an instance where uploads failed for non-admin users due to missing clamav, clamdscan and clamav-daemon packages (plus systemctl enable --now clamav-daemon). Caused 5 minutes of upload downtime.
Low priority: Configure nginx on laserjet-elixire and do more DNS tweaking so that traffic from CF doesn't go through dabbox
Verify that DB backups work
Fix DB backups (install jq, import keys of devs for encryption, change script so that ave's latest key is used (whoops!))
Whoops: Fix an instance where elixire was inaccessible due to nginx having issues, and also due to me forgetting to paste in some lines. It's all good now. Caused a downtime that took upwards of 12m. This is still ongoing on some domains, but is caused by cloudflare caching. Requests time out to these domains.
Verify that image backups work (I had to move to latest duplicity for reasons I explained here)

Running:

We're (Well, Luna is) investigating an issue with clamav.

To Do:

(none)

Done: a3.pm XMPP Services Maintenance Downtime 2020-04-19

Sun, 19 Apr 2020 16:49:37 +0000

This is an incident report on our blog page. Page isn't automatically refreshed, please hit F5 yourself. We do post live updates on our discord guild: https://discord.gg/urgYG9S

Result

This maintenance was successful. Downtime end time: 8:02PM GMT As of 8:30PM GMT, extended backups are also now enabled, and as of 8:38PM GMT, automated MAM clearing was deployed, but these required no downtime.

Total downtime was 2 hours and 12 minutes. Total maintenance period (excluding time between announcement and downtime) was 2 hours and 48 minutes.

Announcement:

Hello everyone,

We'll be having a maintenance downtime, starting in an hour (5:50PM GMT).

This will be a rather significant maintenance:

- We'll update from ejabberd 20.01 to 20.03

Changelogs:
https://www.process-one.net/blog/ejabberd-20-02/
https://www.process-one.net/blog/ejabberd-20-03/

- We'll update from PostgreSQL 10 to 12
- We'll run a database cleanup (VACUUM right after this announcement, may cause performance issues, and VACUUM FULL after 12 update). We expect this to improve the performance and save a lot of storage space.
- We'll extend our backups.

We don't have an ETA on when it will be back, we expect several hours at least.

- You can follow this on our blog at https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19
- Ironically enough, you can also follow the progress live on our discord guild: https://discord.gg/urgYG9S

Thank you for your interest,
Ave

Checklist

Done:

Announce downtime
PostgreSQL VACUUM

The VACUUM wasn't as efficient as expected, and there isn't enough space for a dump. Will set up a remote for a raw backup of the PostgreSQL folder.

Set up remote disk for raw backup
Wait until announced downtime
Take down ejabberd
Take a raw database backup to remote disk

This took much longer than expected.

Run VACUUM FULL on PostgreSQL 10

This also wasn't as efficient as expected. We'll be wiping old MAMs once the server is up.

Results from this indicate that MAM cleanup is the way to go:

         relation          | total_size
---------------------------+------------
 public.archive            | 11 GB
 public.caps_features      | 20 MB
 public.pubsub_item        | 13 MB
 public.vcard              | 6072 kB
 public.pubsub_node_option | 4592 kB
(5 rows)

~~Dump database on postgresql 10 (pg_dump)~~
Install PostgreSQL 12
~~Import the database dump to PostgreSQL 12~~
Use pg_upgrade to upgrade directly from PostgreSQL 10 to 12.

This article helped.

To note, our new PostgreSQL 12 instance used port 5433 after migration. We had to change this back to 5432 in psql config and restart the systemd service to ensure that nothing required config changes.

Delete PSQL 10's raw files
Take ejabberd 20.01 up
Test that ejabberd 20.01 works with our PostgreSQL 12 setup
Clean MAM on ejabberd

We deleted MAM and old messages past 14 days. This didn't affect size yet, but will after VACUUM.

Take down ejabberd 20.01
Run VACUUM FULL again

Done! We cut the DB from ~13GB to 1.5GB.

Update ejabberd to 20.03
Take ejabberd up
Do final ejabberd tests
Announce success
Deploy the extended backup mechanisms
Write and deploy a daily task to remove MAM messages older than 14 days.

Running:

(none)

To Do:

(none)