Tech:Incidents/2018-09-05-metawiki-down

Summary
Provide a summary of the incident:
 * What services were affected?
 * metawiki
 * How long was there a visible outage?
 * Estimated time around 5mins.
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Yes, deploying without running populateGroupPermissions.php before switching ManageWikiPermissions on.
 * Was the incident aggravated by human contact, users or investigating?
 * No

Timeline
Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.


 * [19:05:18] <+Not-d1e4> [miraheze/mw-config] paladox pushed 1 commit to master [+0/-0/±1] https://git.io/fA0So (paladox switches on wgManageWikiPermissionsManagement for metawiki)
 * First reports of metawiki having permission errors for logged out users and non sysops:
 * [19:06:46]  Uh.. meta just went down
 * [19:07:26]  https://i.imgur.com/iMvNOag.jpg
 * [19:07:32]  ^ Meta is down
 * John asks if i ran the maint script ([19:08:14] <@JohnLewis> did you run the maintenance script?)
 * Which i (paladox) replied ([19:08:26] <@paladox> what maint script?)
 * John then tells me which script ([19:08:37] <@JohnLewis> populaeGroupPermissions)
 * [20:10:38] <@paladox>  JohnLewis i ran "/srv/mediawiki/w/extensions/ManageWiki/maintenance/populateGroupPermissions.php" (but failed)
 * John then tells me that i had to do it before i enabled ManageWikiPermissions on metawiki ([19:10:51] <@JohnLewis> paladox: "that should have been ran BEFORE enabling it")
 * I fixed the perms preventing annon users from viewing metawiki ([19:11:38] <@paladox>  Fixed JohnLewis ) but have not fixed it quite yet.
 * John informs me that i have not fixed it toally yet
 * [19:12:18] <@JohnLewis>  uh not quite
 * [19:12:34] <@JohnLewis>  you haven't move any of the meta wgGroupPermissions config over as needed
 * [19:13:04] <@JohnLewis>  stewards have no local access besides userrights
 * [19:16:36] <@paladox>  JohnLewis so i copy
 * [19:16:41] <@paladox>  https://github.com/miraheze/mw-config/blob/master/LocalSettings.php#L5033
 * [19:16:44] <@paladox>  to wgManageWikiPermissionsAdditionalRights
 * [19:17:21] <@paladox>  ?
 * [19:18:05] <@Voidwalker> pretty sure just the section for steward is missing
 * [19:18:17] <@JohnLewis>  for any blacklist listed groups and rights yes
 * [19:19:26] <+Not-d1e4> [mw-config] paladox created branch paladox-patch-1 - https://git.io/vbvb3
 * [19:19:28] <+Not-d1e4> [miraheze/mw-config] paladox pushed 1 commit to paladox-patch-1 [+0/-0/±1] https://git.io/fA0HT
 * [19:19:29] <+Not-d1e4> [miraheze/mw-config] paladox 3a0e98c - Add metawiki to wgManageWikiPermissionsAdditionalRights
 * [19:19:31] <+Not-d1e4> [mw-config] paladox opened pull request #2428: Add metawiki to wgManageWikiPermissionsAdditionalRights - https://git.io/fA0Hk
 * [19:19:34] <@paladox>  JohnLewis Voidwalker ^^
 * [19:19:49] <@JohnLewis>  completely ignored
 * [19:19:55] <@JohnLewis>  I said blacklisted groups and rights oNLY
 * [19:20:03] <@JohnLewis>  otherwise that makes ManageWikiPermissions useless
 * [19:28:54] <+Not-d1e4> [mw-config] paladox synchronize pull request #2428: Add metawiki to wgManageWikiPermissionsAdditionalRights - https://git.io/fA0Hk
 * [19:28:55] <+Not-d1e4> [miraheze/mw-config] paladox pushed 1 commit to paladox-patch-1 [+0/-0/±1] https://git.io/fA0HS
 * [19:28:57] <+Not-d1e4> [miraheze/mw-config] paladox 8dd1a23 - Update LocalSettings.php
 * [19:28:57] <@paladox>  merging that
 * [19:29:14] JohnLewis sighs
 * [19:29:20] <@JohnLewis>  you merge, I'll fix
 * [19:34:27] <+Not-d1e4> [miraheze/mw-config] JohnFLewis pushed 1 commit to master [+0/-0/±1] https://git.io/fA0Qs
 * [19:34:28] <+Not-d1e4> [miraheze/mw-config] JohnFLewis 8702ef5 - fix additionalrights array
 * [19:34:38] <@JohnLewis>  paladox: ^
 * [19:36:57] <+Not-d1e4> [miraheze/mw-config] paladox pushed 1 commit to master [+0/-0/±1] https://git.io/fA0Qg
 * [19:36:58] <+Not-d1e4> [miraheze/mw-config] paladox ea8679c - Remove metawiki from wgGroupPermissions has been migrated to ManageWikiPermissions
 * [20:39:56] <@Voidwalker> hmm, steward lost access to managewiki and noratelimit with that change
 * [20:41:53] <@JohnLewis>  Voidwalker: make a PR and readd them then ig
 * [19:44:52] <+Not-d1e4> [mw-config] The-Voidwalker opened pull request #2429: add managewiki and noratelimit back to steward - https://git.io/fA07n
 * [19:45:24] <+Not-d1e4> [mw-config] JohnFLewis closed pull request #2429: add managewiki and noratelimit back to steward - https://git.io/fA07n
 * [19:45:25] <+Not-d1e4> [miraheze/mw-config] JohnFLewis pushed 2 commits to master [+0/-0/±2] https://git.io/fA078
 * [19:45:26] <+Not-d1e4> [miraheze/mw-config] The-Voidwalker 855c54d - add managewiki and noratelimit back to steward
 * [19:45:28] <+Not-d1e4> [miraheze/mw-config] JohnFLewis 28b43f8 - Merge pull request #2429 from The-Voidwalker/patch-5 add managewiki and noratelimit back to steward

Quick facts
Provide any relevant quick facts that may be relevant:
 * Are there any known issues with the service in production?
 * Nope
 * Was the cause preventable by us?
 * Yes, if i (paladox) ran the script and moved the settings to wgManageWikiPermissionsAdditionalRights this could have been preventable.
 * Have there been any similar incidents?
 * Nope.

Conclusions
Provide conclusions that have been drawn from this incident only:
 * Was the incident preventable? If so, how?
 * By running the maint script.
 * Is the issue rooted in our infrastructure design?
 * Nope.
 * State any weaknesses and how they can be addressed.
 * None.
 * State any strengths and how they prevented or assisted in investigating the incident.

Actionables
List all things we can do immediately (or in our current state) to prevent this occurring again. Include links to Phabricator issues which should go into more detail, these should only be one line notes! e.g. ": Monitor service responses with GDNSD and pool/depool servers based on these."
 * As it was human error theres nothing to prevent it from occuring in the future. Except remembering to running a maint script and also moving permissions to wgManageWikiPermissionsAdditionalRights.

Meta

 * Who responded to this incident?
 * JohnLewis and Paladox
 * What services were affected?
 * Mediawiki (only metawiki)
 * Who, therefore, needs to review this report?
 * Ops (john)
 * Timestamp.
 * 19:05:18