Done: a3.pm XMPP Services Maintenance Downtime 2020-04-19

This is an incident report on our blog page. Page isn't automatically refreshed, please hit F5 yourself. We do post live updates on our discord guild: https://discord.gg/urgYG9S

Result

This maintenance was successful. Downtime end time: 8:02PM GMT As of 8:30PM GMT, extended backups are also now enabled, and as of 8:38PM GMT, automated MAM clearing was deployed, but these required no downtime.

Total downtime was 2 hours and 12 minutes. Total maintenance period (excluding time between announcement and downtime) was 2 hours and 48 minutes.

Announcement:

Hello everyone,

We'll be having a maintenance downtime, starting in an hour (5:50PM GMT).

This will be a rather significant maintenance:

- We'll update from ejabberd 20.01 to 20.03

Changelogs:
https://www.process-one.net/blog/ejabberd-20-02/
https://www.process-one.net/blog/ejabberd-20-03/

- We'll update from PostgreSQL 10 to 12
- We'll run a database cleanup (VACUUM right after this announcement, may cause performance issues, and VACUUM FULL after 12 update). We expect this to improve the performance and save a lot of storage space.
- We'll extend our backups.

We don't have an ETA on when it will be back, we expect several hours at least.

- You can follow this on our blog at https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19
- Ironically enough, you can also follow the progress live on our discord guild: https://discord.gg/urgYG9S

Thank you for your interest,
Ave

Checklist

Done:

The VACUUM wasn't as efficient as expected, and there isn't enough space for a dump. Will set up a remote for a raw backup of the PostgreSQL folder.

This took much longer than expected.

This also wasn't as efficient as expected. We'll be wiping old MAMs once the server is up.

Results from this indicate that MAM cleanup is the way to go:

         relation          | total_size
---------------------------+------------
 public.archive            | 11 GB
 public.caps_features      | 20 MB
 public.pubsub_item        | 13 MB
 public.vcard              | 6072 kB
 public.pubsub_node_option | 4592 kB
(5 rows)

This article helped.

To note, our new PostgreSQL 12 instance used port 5433 after migration. We had to change this back to 5432 in psql config and restart the systemd service to ensure that nothing required config changes.

We deleted MAM and old messages past 14 days. This didn't affect size yet, but will after VACUUM.

Done! We cut the DB from ~13GB to 1.5GB.

Running:

(none)

To Do:

(none)