Tech:Incidents/2016-05-23-mw1

A crash of php5-fpm on mw1 caused 67 minutes of partial site outage.

Timeline

 * 00:00 mw1 php5-fpm crashes due to OOM (see below for dmesg)
 * 01:07 revi restarts HHVM, mw1 recovers

Conclusions
May 23 00:00:19 mw1 kernel: [37297076.453605] Out of memory in UB 41075: OOM killed process 25870 (php5-fpm) score 83 vm:767020kB, rss:64936kB, swap:0kB (...) May 23 00:01:02 mw1 kernel: [37297119.671165] Out of memory in UB 41075: OOM killed process 27488 (php5-fpm) score 8 vm:676492kB, rss:6328kB, swap:0kB May 23 00:01:02 mw1 kernel: [37297119.689533] OOM killer in rage, 1 tasks killed May 23 00:01:02 mw1 kernel: [37297119.690122] Out of memory in UB 41075: OOM killed process 27489 (php5-fpm) score 9 vm:676496kB, rss:6748kB, swap:0kB
 * Due to an out of memory situation, php5-fpm crashed.
 * John removed the Varnish health checks because they have been causing issues - but these health checks have never been enabled again. The result was that Varnish did not attempt to depool mw1.

Actionables

 * Re-add the Varnish health checks
 * See how we can lower memory usage on the appservers

Meta

 * Incident handled by: revi
 * Affected services: site (~50%, everything that was not cached in Varnish)
 * Signature: Southparkfan (talk) 16:28, 23 May 2016 (UTC)