Tech:Incidents/2015-12-28-SiteOutage

Comment (February 2016): while HHVM *did* crash, it crashed because of another (unknown) problem. Since December we have a lot of issues with HHVM, and 98% of the time we are running PHP-FPM. Update (March 2016): the cause of most of the crashes seems to be the Linux OOM-killer. We've implemented a cron that restarts HHVM each two hours, and we'll soon look if we can assign more memory to our servers.

An HHVM overload Something on mw1 brought HHVM down, compared with a lack of notifying during downtime, with as result 22 minutes of downtime (502 errors).

Timeline

 * 21:17 HHVM went down, first signs of trouble in the NGINX error log
 * 21:37 Southparkfan: noticed Miraheze went down
 * 21:39 Southparkfan: restarted HHVM. Couldn't initially find why it crashed, until I found a lot of traffic was going to mw1 at the moment HHVM crashed.

Conclusions

 * A bot (MJ12Bot was massively requesting dynamic content (special pages) from mw1, and at some point HHVM was not able anymore to keep up with the load, so it crashed . We don't know if this was the case.
 * Icinga failed to notify us when HHVM crashed. It does not have any checks for the HHVM process itself (it only has for HTTP/HTTPS). Because nginx was still running, it mistakenly 'thought' Miraheze was still up.

Actionables

 * Deny (or slow down) robots via robots.txt - ✅ in https://github.com/miraheze/puppet/commit/83fcafec7193d3124863fb670b4f28897713abcf
 * Make sure Icinga notifies us if HHVM isn't running - ✅ in https://github.com/miraheze/puppet/commit/b8d6c682666ab8b569f5060a461aa9dadf913063
 * Deploy additional MediaWiki servers that can serve the traffic until HHVM is restarted by someone - ✅ mw2 has been deployed.
 * Add Terms of Use to that people know how often to send requests without breaking things.

Meta

 * Online during downtime: Southparkfan
 * Affected services: MediaWiki
 * Signature: Southparkfan (talk) 22:26, 28 December 2015 (UTC)