<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>LavaTech Incidents</title>
    <link>https://wf.lavatech.top/lavatech-incidents/</link>
    <description>Live updated incident logs from LavaTech</description>
    <pubDate>Tue, 21 Apr 2026 15:38:14 +0000</pubDate>
    <item>
      <title>June 15, 2021 Outage Postmortem</title>
      <link>https://wf.lavatech.top/lavatech-incidents/june-15-2021-outage-postmortem</link>
      <description>&lt;![CDATA[On June 15, we had roughly 64 minutes of downtime (between 01:05UTC and 02:09UTC) on our US servers which took down a majority of our services.&#xA;&#xA;This was caused by a power outage one of the legs (&#34;A-side&#34;) on the rack said services were located on. Other people on the same location have also reported the same issue:&#xA;&#xA;blockquote class=&#34;twitter-tweet&#34;p lang=&#34;en&#34; dir=&#34;ltr&#34;Our Fremont a href=&#34;https://twitter.com/henet?refsrc=twsrc%5Etfw&#34;@henet/a POP is experiencing a power outage - suites 1300 &amp;amp; 1400 impacted, no ETA/p&amp;mdash; SF Internet Exchange (@sfmix) a href=&#34;https://twitter.com/sfmix/status/1404613224058683393?refsrc=twsrc%5Etfw&#34;June 15, 2021/a/blockquote script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;/script&#xA;&#xA;The power provided to the rack comes in two legs (&#34;A-side&#34; and &#34;B-side&#34;), and most of the equipment on the rack is connected to both legs for redundancy. The router is among the few equipment on the rack without a redundant PSU (and definitely the most critical one), and at the time of the power cut it was connected to the leg that went down.&#xA;&#xA;Indeed, when the power cut happened, most of the servers actually stayed up, but without a working network connection. This was also visible afterwards on our energy use graphs on our B-side PDU:&#xA;&#xA;B-side PDU graphs showing a significant spike from ~9A to ~16.5A for roughly an hour&#xA;&#xA;As this issue affected our services solely because of lack of redundant PSUs on the router, options for replacing it are being researched as we speak. More updates will be provided once it is replaced.]]&gt;</description>
      <content:encoded><![CDATA[<p>On June 15, we had roughly 64 minutes of downtime (between 01:05UTC and 02:09UTC) on our US servers which took down a majority of our services.</p>

<p>This was caused by a power outage one of the legs (“A-side”) on the rack said services were located on. Other people on the same location have also reported the same issue:</p>

<p><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Our Fremont <a href="https://twitter.com/henet?ref_src=twsrc%5Etfw" rel="nofollow">@henet</a> POP is experiencing a power outage – suites 1300 &amp; 1400 impacted, no ETA</p>— SF Internet Exchange (@sfmix) <a href="https://twitter.com/sfmix/status/1404613224058683393?ref_src=twsrc%5Etfw" rel="nofollow">June 15, 2021</a></blockquote> </p>

<p>The power provided to the rack comes in two legs (“A-side” and “B-side”), and most of the equipment on the rack is connected to both legs for redundancy. The router is among the few equipment on the rack without a redundant PSU (and definitely the most critical one), and at the time of the power cut it was connected to the leg that went down.</p>

<p>Indeed, when the power cut happened, most of the servers actually stayed up, but without a working network connection. This was also visible afterwards on our energy use graphs on our B-side PDU:</p>

<p><img src="https://elixi.re/t/mm0nv8cbv.png" alt="B-side PDU graphs showing a significant spike from ~9A to ~16.5A for roughly an hour"></p>

<p>As this issue affected our services solely because of lack of redundant PSUs on the router, options for replacing it are being researched as we speak. More updates will be provided once it is replaced.</p>
]]></content:encoded>
      <guid>https://wf.lavatech.top/lavatech-incidents/june-15-2021-outage-postmortem</guid>
      <pubDate>Wed, 16 Jun 2021 08:50:39 +0000</pubDate>
    </item>
    <item>
      <title>Elixi.re Move Downtime 2020-07-22</title>
      <link>https://wf.lavatech.top/lavatech-incidents/elixi-re-move-downtime-2020-07-22</link>
      <description>&lt;![CDATA[This is an incident report on our blog page. Page isn&#39;t automatically refreshed, please hit F5 yourself. We do post live updates on the elixire discord guild.&#xA;&#xA;Result&#xA;&#xA;This move was successful.&#xA;&#xA;Total upload downtime was ~27 minutes, and total access downtime was ~14m. Some domains are facing more downtime due to Cloudflare.&#xA;&#xA;!--more--&#xA;&#xA;Announcement:&#xA;&#xA;Hello everynyan!&#xA;&#xA;The big day has come, I&#39;m finally feeling like moving elixire from edgebleed to laserjet serber. The downtime will happen in 2.75 hours, at 5PM UTC. Expected downtime is 30 minutes, but may be shorter or longer than that. Images will still load, but no new API activity will be allowed during this time (new uploads, registers, deletions etc).&#xA;&#xA;What&#39;s so important about this move, I hear you asking, here, let me explain it with a dead meme&#xA;&#xA;As I should be a responsible adult, I&#39;ll note that it doesn&#39;t have an entire /24 or /32 to itself&#xA;&#xA;Checklist&#xA;&#xA;Done:&#xA;&#xA;Do initial setup of elixire on laserjet-elixire&#xA;Do initial setup of firewall on laserjet-elixire&#xA;Announce downtime&#xA;Do initial rsync of images from edgebleed-elixire to laserjet-elixire&#xA;Do nginx setup, port it for new setup on laserjet-elixire&#xA;Prepare readonly mode on edgebleed-elixire&#xA;Install pm2 on laserjet-elixire&#xA;Set up backups on laserjet-elixire&#xA;Do rsync roughly every hour until the downtime&#xA;Go on read-only mode on edgebleed-elixire (rm /etc/nginx/sites-enabled/elixire.conf;ln -s /etc/nginx/sites-available/elixire-ro.conf /etc/nginx/sites-enabled/;systemctl reload nginx)&#xA;Verify that elixire is on RO mode&#xA;Disable backups task on edgebleed-elixire (systemctl disable --now elixirebackupdb.timer; systemctl disable --now elixirebackup.timer)&#xA;One last rsync&#xA;Dump db on edgebleed-elixire (pg_dumpall?)&#xA;Copy db dump from edgebleed-elixire to laserjet-elixire (wormhole?)&#xA;Import db (psql -f dumpfile postgres)&#xA;Start elixire manually on laserjet-elixire&#xA;Move reverse records on dabbox, reload nginx&#xA;Verify everything works&#xA;Move elixire instance to pm2&#xA;Announce uptime&#xA;Start backups task on laserjet-elixire&#xA;Whoops: Fixed an instance where uploads failed for non-admin users due to missing clamav, clamdscan and clamav-daemon packages (plus systemctl enable --now clamav-daemon). Caused 5 minutes of upload downtime.&#xA;Low priority: Configure nginx on laserjet-elixire and do more DNS tweaking so that traffic from CF doesn&#39;t go through dabbox&#xA;Verify that DB backups work&#xA;Fix DB backups (install jq, import keys of devs for encryption, change script so that ave&#39;s latest key is used (whoops!))&#xA;Whoops: Fix an instance where elixire was inaccessible due to nginx having issues, and also due to me forgetting to paste in some lines. It&#39;s all good now. Caused a downtime that took upwards of 12m. This is still ongoing on some domains, but is caused by cloudflare caching. Requests time out to these domains.&#xA;Verify that image backups work (I had to move to latest duplicity for reasons I explained here)&#xA;&#xA;Running:&#xA;&#xA;We&#39;re (Well, Luna is) investigating an issue with clamav.&#xA;&#xA;To Do:&#xA;&#xA;(none)]]&gt;</description>
      <content:encoded><![CDATA[<p><em>This is an incident report on our blog page. Page isn&#39;t automatically refreshed, please hit F5 yourself. We do post live updates on the elixire discord guild.</em></p>

<h2 id="result" id="result">Result</h2>

<p><strong>This move was successful.</strong></p>

<p>Total upload downtime was ~27 minutes, and total access downtime was ~14m. Some domains are facing more downtime due to Cloudflare.</p>



<h2 id="announcement" id="announcement">Announcement:</h2>

<p>Hello everynyan!</p>

<p>The big day has come, I&#39;m finally feeling like moving elixire from edgebleed to laserjet serber. The downtime will happen in 2.75 hours, at 5PM UTC. Expected downtime is 30 minutes, but may be shorter or longer than that. Images will still load, but no new API activity will be allowed during this time (new uploads, registers, deletions etc).</p>

<p>What&#39;s so important about this move, I hear you asking, here, let me explain it with a dead meme</p>

<p><img src="https://cdn.discordapp.com/attachments/423303447085973543/735500602020462602/virginvschad.png" alt=""></p>

<p><del>As I should be a responsible adult, I&#39;ll note that it doesn&#39;t have an entire /24 or /32 to itself</del></p>

<h2 id="checklist" id="checklist">Checklist</h2>

<p><strong>Done:</strong></p>
<ul><li>Do initial setup of elixire on laserjet-elixire</li>
<li>Do initial setup of firewall on laserjet-elixire</li>
<li>Announce downtime</li>
<li>Do initial rsync of images from edgebleed-elixire to laserjet-elixire</li>
<li>Do nginx setup, port it for new setup on laserjet-elixire</li>
<li>Prepare readonly mode on edgebleed-elixire</li>
<li>Install pm2 on laserjet-elixire</li>
<li>Set up backups on laserjet-elixire</li>
<li>Do rsync roughly every hour until the downtime</li>
<li>Go on read-only mode on edgebleed-elixire (<code>rm /etc/nginx/sites-enabled/elixire.conf;ln -s /etc/nginx/sites-available/elixire-ro.conf /etc/nginx/sites-enabled/;systemctl reload nginx</code>)</li>
<li>Verify that elixire is on RO mode</li>
<li>Disable backups task on edgebleed-elixire (<code>systemctl disable --now elixirebackupdb.timer; systemctl disable --now elixirebackup.timer</code>)</li>
<li>One last rsync</li>
<li>Dump db on edgebleed-elixire (<a href="https://www.postgresql.org/docs/11/backup-dump.html#BACKUP-DUMP-RESTORE" rel="nofollow">pg_dumpall?</a>)</li>
<li>Copy db dump from edgebleed-elixire to laserjet-elixire (wormhole?)</li>
<li>Import db (psql -f dumpfile postgres)</li>
<li>Start elixire manually on laserjet-elixire</li>
<li>Move reverse records on dabbox, reload nginx</li>
<li>Verify everything works</li>
<li>Move elixire instance to pm2</li>
<li>Announce uptime</li>
<li>Start backups task on laserjet-elixire</li>
<li><strong>Whoops:</strong> Fixed an instance where uploads failed for non-admin users due to missing clamav, clamdscan and clamav-daemon packages (plus <code>systemctl enable --now clamav-daemon</code>). Caused 5 minutes of upload downtime.</li>
<li>Low priority: Configure nginx on laserjet-elixire and do more DNS tweaking so that traffic from CF doesn&#39;t go through dabbox</li>
<li>Verify that DB backups work</li>
<li>Fix DB backups (install jq, import keys of devs for encryption, change script so that ave&#39;s latest key is used (whoops!))</li>
<li><strong>Whoops:</strong> Fix an instance where elixire was inaccessible due to nginx having issues, and also due to me forgetting to paste in some lines. It&#39;s all good now. Caused a downtime that took upwards of 12m. This is still ongoing on some domains, but is caused by cloudflare caching. Requests time out to these domains.</li>
<li>Verify that image backups work (<a href="https://askubuntu.com/a/1261211/511534" rel="nofollow">I had to move to latest duplicity for reasons I explained here</a>)</li></ul>

<p><strong>Running:</strong></p>
<ul><li>We&#39;re (Well, Luna is) investigating an issue with clamav.</li></ul>

<p><strong>To Do:</strong></p>

<p>(none)</p>
]]></content:encoded>
      <guid>https://wf.lavatech.top/lavatech-incidents/elixi-re-move-downtime-2020-07-22</guid>
      <pubDate>Wed, 22 Jul 2020 15:31:22 +0000</pubDate>
    </item>
    <item>
      <title>Done: a3.pm XMPP Services Maintenance Downtime 2020-04-19</title>
      <link>https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19</link>
      <description>&lt;![CDATA[This is an incident report on our blog page. Page isn&#39;t automatically refreshed, please hit F5 yourself. We do post live updates on our discord guild: https://discord.gg/urgYG9S&#xA;&#xA;Result&#xA;&#xA;This maintenance was successful. Downtime end time: 8:02PM GMT As of 8:30PM GMT, extended backups are also now enabled, and as of 8:38PM GMT, automated MAM clearing was deployed, but these required no downtime.&#xA;&#xA;Total downtime was 2 hours and 12 minutes. Total maintenance period (excluding time between announcement and downtime) was 2 hours and 48 minutes.&#xA;&#xA;!--more--&#xA;&#xA;Announcement:&#xA;&#xA;Hello everyone,&#xA;&#xA;We&#39;ll be having a maintenance downtime, starting in an hour (5:50PM GMT).&#xA;&#xA;This will be a rather significant maintenance:&#xA;&#xA;We&#39;ll update from ejabberd 20.01 to 20.03&#xA;&#xA;Changelogs:&#xA;https://www.process-one.net/blog/ejabberd-20-02/&#xA;https://www.process-one.net/blog/ejabberd-20-03/&#xA;&#xA;We&#39;ll update from PostgreSQL 10 to 12&#xA;We&#39;ll run a database cleanup (VACUUM right after this announcement, may cause performance issues, and VACUUM FULL after 12 update). We expect this to improve the performance and save a lot of storage space.&#xA;We&#39;ll extend our backups.&#xA;&#xA;We don&#39;t have an ETA on when it will be back, we expect several hours at least.&#xA;&#xA;You can follow this on our blog at https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19&#xA;Ironically enough, you can also follow the progress live on our discord guild: https://discord.gg/urgYG9S&#xA;&#xA;Thank you for your interest,&#xA;Ave&#xA;&#xA;Checklist&#xA;&#xA;Done:&#xA;&#xA;Announce downtime&#xA;PostgreSQL VACUUM&#xA;&#xA;The VACUUM wasn&#39;t as efficient as expected, and there isn&#39;t enough space for a dump. Will set up a remote for a raw backup of the PostgreSQL folder.&#xA;&#xA;Set up remote disk for raw backup&#xA;Wait until announced downtime&#xA;Take down ejabberd&#xA;Take a raw database backup to remote disk&#xA;&#xA;This took much longer than expected.&#xA;&#xA;Run VACUUM FULL on PostgreSQL 10&#xA;&#xA;This also wasn&#39;t as efficient as expected. We&#39;ll be wiping old MAMs once the server is up.&#xA;&#xA;Results from this indicate that MAM cleanup is the way to go:&#xA;&#xA;         relation          | totalsize&#xA;---------------------------+------------&#xA; public.archive            | 11 GB&#xA; public.capsfeatures      | 20 MB&#xA; public.pubsubitem        | 13 MB&#xA; public.vcard              | 6072 kB&#xA; public.pubsubnodeoption | 4592 kB&#xA;(5 rows)&#xA;&#xA;Dump database on postgresql 10 (pgdump)&#xA;Install PostgreSQL 12&#xA;Import the database dump to PostgreSQL 12&#xA;Use pg_upgrade to upgrade directly from PostgreSQL 10 to 12.&#xA;&#xA;This article helped.&#xA;&#xA;To note, our new PostgreSQL 12 instance used port 5433 after migration. We had to change this back to 5432 in psql config and restart the systemd service to ensure that nothing required config changes.&#xA;&#xA;Delete PSQL 10&#39;s raw files&#xA;Take ejabberd 20.01 up&#xA;Test that ejabberd 20.01 works with our PostgreSQL 12 setup&#xA;Clean MAM on ejabberd&#xA;&#xA;We deleted MAM and old messages past 14 days. This didn&#39;t affect size yet, but will after VACUUM.&#xA;&#xA;Take down ejabberd 20.01&#xA;Run VACUUM FULL again&#xA;&#xA;Done! We cut the DB from ~13GB to 1.5GB.&#xA;&#xA;Update ejabberd to 20.03&#xA;Take ejabberd up&#xA;Do final ejabberd tests&#xA;Announce success&#xA;Deploy the extended backup mechanisms&#xA;Write and deploy a daily task to remove MAM messages older than 14 days.&#xA;&#xA;Running:&#xA;&#xA;(none)&#xA;&#xA;To Do:&#xA;&#xA;(none)]]&gt;</description>
      <content:encoded><![CDATA[<p><em>This is an incident report on our blog page. Page isn&#39;t automatically refreshed, please hit F5 yourself. We do post live updates on our discord guild: <a href="https://discord.gg/urgYG9S" rel="nofollow">https://discord.gg/urgYG9S</a></em></p>

<h2 id="result" id="result">Result</h2>

<p><strong>This maintenance was successful. Downtime end time: 8:02PM GMT</strong> As of 8:30PM GMT, extended backups are also now enabled, and as of 8:38PM GMT, automated MAM clearing was deployed, but these required no downtime.</p>

<p>Total downtime was 2 hours and 12 minutes. Total maintenance period (excluding time between announcement and downtime) was 2 hours and 48 minutes.</p>



<h2 id="announcement" id="announcement">Announcement:</h2>

<pre><code>Hello everyone,

We&#39;ll be having a maintenance downtime, starting in an hour (5:50PM GMT).

This will be a rather significant maintenance:

- We&#39;ll update from ejabberd 20.01 to 20.03

Changelogs:
https://www.process-one.net/blog/ejabberd-20-02/
https://www.process-one.net/blog/ejabberd-20-03/

- We&#39;ll update from PostgreSQL 10 to 12
- We&#39;ll run a database cleanup (VACUUM right after this announcement, may cause performance issues, and VACUUM FULL after 12 update). We expect this to improve the performance and save a lot of storage space.
- We&#39;ll extend our backups.

We don&#39;t have an ETA on when it will be back, we expect several hours at least.

- You can follow this on our blog at https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19
- Ironically enough, you can also follow the progress live on our discord guild: https://discord.gg/urgYG9S

Thank you for your interest,
Ave
</code></pre>

<h2 id="checklist" id="checklist">Checklist</h2>

<p><strong>Done:</strong></p>
<ul><li>Announce downtime</li>
<li>PostgreSQL VACUUM</li></ul>

<p>The VACUUM wasn&#39;t as efficient as expected, and there isn&#39;t enough space for a dump. Will set up a remote for a raw backup of the PostgreSQL folder.</p>
<ul><li>Set up remote disk for raw backup</li>
<li>Wait until announced downtime</li>
<li>Take down ejabberd</li>
<li>Take a raw database backup to remote disk</li></ul>

<p>This took much longer than expected.</p>
<ul><li>Run VACUUM FULL on PostgreSQL 10</li></ul>

<p>This also wasn&#39;t as efficient as expected. We&#39;ll be wiping old MAMs once the server is up.</p>

<p>Results from <a href="https://makandracards.com/makandra/52141-postgresql-how-to-show-table-sizes" rel="nofollow">this</a> indicate that MAM cleanup is the way to go:</p>

<pre><code>         relation          | total_size
---------------------------+------------
 public.archive            | 11 GB
 public.caps_features      | 20 MB
 public.pubsub_item        | 13 MB
 public.vcard              | 6072 kB
 public.pubsub_node_option | 4592 kB
(5 rows)
</code></pre>
<ul><li><del>Dump database on postgresql 10 (pg_dump)</del></li>
<li>Install PostgreSQL 12</li>
<li><del>Import the database dump to PostgreSQL 12</del></li>
<li>Use pg_upgrade to upgrade directly from PostgreSQL 10 to 12.</li></ul>

<p><a href="https://www.kostolansky.sk/posts/upgrading-to-postgresql-12/" rel="nofollow">This article helped.</a></p>

<p>To note, our new PostgreSQL 12 instance used port 5433 after migration. We had to change this back to 5432 in psql config and restart the systemd service to ensure that nothing required config changes.</p>
<ul><li>Delete PSQL 10&#39;s raw files</li>
<li>Take ejabberd 20.01 up</li>
<li>Test that ejabberd 20.01 works with our PostgreSQL 12 setup</li>
<li>Clean MAM on ejabberd</li></ul>

<p>We deleted MAM and old messages past 14 days. This didn&#39;t affect size yet, but will after VACUUM.</p>
<ul><li>Take down ejabberd 20.01</li>
<li>Run <code>VACUUM FULL</code> again</li></ul>

<p>Done! We cut the DB from ~13GB to 1.5GB.</p>
<ul><li>Update ejabberd to 20.03</li>
<li>Take ejabberd up</li>
<li>Do final ejabberd tests</li>
<li>Announce success</li>
<li>Deploy the extended backup mechanisms</li>
<li>Write and deploy a daily task to remove MAM messages older than 14 days.</li></ul>

<p><strong>Running:</strong></p>

<p>(none)</p>

<p><strong>To Do:</strong></p>

<p>(none)</p>
]]></content:encoded>
      <guid>https://wf.lavatech.top/lavatech-incidents/a3-pm-xmpp-services-maintenance-downtime-2020-04-19</guid>
      <pubDate>Sun, 19 Apr 2020 16:49:37 +0000</pubDate>
    </item>
  </channel>
</rss>