Done: a3.pm XMPP Services Maintenance Downtime 2020-04-19
This is an incident report on our blog page. Page isn't automatically refreshed, please hit F5 yourself. We do post live updates on our discord guild: https://discord.gg/urgYG9S
Result
This maintenance was successful. Downtime end time: 8:02PM GMT As of 8:30PM GMT, extended backups are also now enabled, and as of 8:38PM GMT, automated MAM clearing was deployed, but these required no downtime.
Total downtime was 2 hours and 12 minutes. Total maintenance period (excluding time between announcement and downtime) was 2 hours and 48 minutes.
Announcement:
Hello everyone,
We'll be having a maintenance downtime, starting in an hour (5:50PM GMT).
This will be a rather significant maintenance:
- We'll update from ejabberd 20.01 to 20.03
Changelogs:
https://www.process-one.net/blog/ejabberd-20-02/
https://www.process-one.net/blog/ejabberd-20-03/
- We'll update from PostgreSQL 10 to 12
- We'll run a database cleanup (VACUUM right after this announcement, may cause performance issues, and VACUUM FULL after 12 update). We expect this to improve the performance and save a lot of storage space.
- We'll extend our backups.
We don't have an ETA on when it will be back, we expect several hours at least.
- You can follow this on our blog at https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19
- Ironically enough, you can also follow the progress live on our discord guild: https://discord.gg/urgYG9S
Thank you for your interest,
Ave
Checklist
Done:
- Announce downtime
- PostgreSQL VACUUM
The VACUUM wasn't as efficient as expected, and there isn't enough space for a dump. Will set up a remote for a raw backup of the PostgreSQL folder.
- Set up remote disk for raw backup
- Wait until announced downtime
- Take down ejabberd
- Take a raw database backup to remote disk
This took much longer than expected.
- Run VACUUM FULL on PostgreSQL 10
This also wasn't as efficient as expected. We'll be wiping old MAMs once the server is up.
Results from this indicate that MAM cleanup is the way to go:
relation | total_size
---------------------------+------------
public.archive | 11 GB
public.caps_features | 20 MB
public.pubsub_item | 13 MB
public.vcard | 6072 kB
public.pubsub_node_option | 4592 kB
(5 rows)
Dump database on postgresql 10 (pg_dump)- Install PostgreSQL 12
Import the database dump to PostgreSQL 12- Use pg_upgrade to upgrade directly from PostgreSQL 10 to 12.
To note, our new PostgreSQL 12 instance used port 5433 after migration. We had to change this back to 5432 in psql config and restart the systemd service to ensure that nothing required config changes.
- Delete PSQL 10's raw files
- Take ejabberd 20.01 up
- Test that ejabberd 20.01 works with our PostgreSQL 12 setup
- Clean MAM on ejabberd
We deleted MAM and old messages past 14 days. This didn't affect size yet, but will after VACUUM.
- Take down ejabberd 20.01
- Run
VACUUM FULL
again
Done! We cut the DB from ~13GB to 1.5GB.
- Update ejabberd to 20.03
- Take ejabberd up
- Do final ejabberd tests
- Announce success
- Deploy the extended backup mechanisms
- Write and deploy a daily task to remove MAM messages older than 14 days.
Running:
(none)
To Do:
(none)