Tech:Incidents/2015-11-14-SiteOutage

A network outage of NLSVZS1 brought servers misc1, db1, parsoid1 and mw1 down, thus causing a site outage (Varnish began spewing 503 errors) for around 10 minutes.

Timeline

 * 20:20 Southparkfan: Icinga began massively marking various services of cp1, cp2 and ns1 as CRITICAL
 * 20:24 Southparkfan: noticed Miraheze is down, began investigating. Noticed affected services are all on NLSVZS1
 * 20:25 Southparkfan: not able to login into SolusVM due to slowness
 * 20:27 Southparkfan: managed to login into SolusVM
 * 20:28 Southparkfan: Icinga marks all services as OK, site is back up
 * 20:37 Southparkfan: sent email to RamNode (ticket #421602) because this is not the first time NLSVZS1 is experiencing issues
 * 20:42 RamNode: tells me they were taking care of an outbound DoS attack.

Conclusions
An outbound DoS attack was ongoing on NLSVZS1. While RamNode noticed that very quickly, their staff was not able to prevent an outage.

Actionables
While the outage of NLSVZS1 could not have been prevented by us, we should really put failover services on other nodes.
 * https://github.com/miraheze/puppet/issues/35 - "Purchase failover servers on an SVZ/CVZ node"

Meta

 * Online during downtime: Southparkfan
 * Affected services: MediaWiki, MariaDB, monitoring (icinga/ganglia), Parsoid
 * Signature: Southparkfan (talk) 20:57, 14 November 2015 (UTC)