Tech:Incidents/2018-09-11-all-wikis-down

Summary
Provide a summary of the incident:
 * What services were affected?
 * MediaWiki.


 * How long was there a visible outage?
 * ~5 minutes


 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Yes, i (paladox) should not have ran puppet when John was deploying his change.


 * Was the incident aggravated by human contact, users or investigating?
 * no

Timeline
Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.

This only affected metawiki and a few other selected sites for beta testing of MWP.


 * Problem started after:


 * [21:04:32] 	[miraheze/mediawiki] JohnFLewis pushed 1 commit to REL1_31 [+0/-0/±1] https://git.io/fArFQ
 * [21:04:33] 	[miraheze/mediawiki] JohnFLewis 82c7b81 - pull MW changes


 * When paladox started running puppet (and just after john disabled puppet but i was unaware until puppet ran fully).


 * The first user to report a 503 [21:09:42] 	paladox: 503


 * We then found the error:


 * Error from line 84 of /srv/mediawiki/w/extensions/ManageWiki/includes/ManageWikiHooks.php: Call to a member function get on null


 * paladox then reverts the mw change on mw3 to at least try to restore service to user when john was working on a fix.


 * John deploys fix:


 * [21:15:13] 	[miraheze/ManageWiki] JohnFLewis pushed 1 commit to master [+1/-0/±1] https://git.io/fArb5
 * [21:15:14] 	[miraheze/ManageWiki] JohnFLewis 74ca2dc - fix cach->cache
 * [21:16:23] 	[miraheze/mediawiki] JohnFLewis pushed 1 commit to REL1_31 [+0/-0/±1] https://git.io/fArbh
 * [21:16:24] 	[miraheze/mediawiki] JohnFLewis d0d2108 - pull MW


 * mw1 and mw2 then recovered soon after:
 * [21:17:04] 	RECOVERY - cp5 HTTP 4xx/5xx ERROR Rate on cp5 is OK: OK - NGINX Error Rate is 34%

Quick facts
Provide any relevant quick facts that may be relevant:


 * Are there any known issues with the service in production?
 * Nope.


 * Was the cause preventable by us?
 * Yes, i (paladox) should not have been running mw when John was deploying a change that could be breaking.


 * Have there been any similar incidents?
 * Not that im aware of.

Conclusions
Provide conclusions that have been drawn from this incident only:
 * Was the incident preventable? If so, how?
 * Yes, doin't run or deploy another users change (who is ops) unless they say to.


 * Is the issue rooted in our infrastructure design?
 * Nope.


 * State any weaknesses and how they can be addressed.
 * No weekness.


 * State any strengths and how they prevented or assisted in investigating the incident.
 * None.

Actionables
List all things we can do immediately (or in our current state) to prevent this occurring again. Include links to Phabricator issues which should go into more detail, these should only be one line notes! e.g. ": Monitor service responses with GDNSD and pool/depool servers based on these."


 * Change varnish monitoring from meta to login to try and reduce depools. T3587

Meta

 * Who responded to this incident?
 * Paladox, John.


 * What services were affected?
 * Mediawiki


 * Who, therefore, needs to review this report?
 * Ops (john)


 * Timestamp.
 * 11-09-2018