Tech:Incidents/2016-02-02-MWExceptions

For an unknown reason mw1 got an invalid localisation cache. Since this server is used in production, wikis depending on various extensions with magic words went down if the requests weren't served by mw2 (which seemed fine).

Description
John and I have been working the evening before on making MediaWiki redundant. This process consisted of load balancing traffic between mw1 and mw2, and ensuring we had an NFS mount for mediawiki-static. Due to some small (but unrelated) NFS problems John re-installed mw2, and it was up to me to ensure MediaWiki would work (after installing a MediaWiki server we still manually need to build a localisation cache) on that. Running rebuildLocalisationCache.php on a server that doesn't contain a cache doesn't work (it ends with an error that the localisation cache is missing, ironic), so in the past I always used this small piece of code to have a temporary workaround. After a localisation cache exists I always removed that line afterwards, and this method always worked. However, since it could become annoying in the future to do that, I thought 453830f6961a42ca055a2b7054050932713e4c20 would work nice. In the past I had a lot of trouble with a similar piece of code, but I thought this worked (and in the beginning it seemed that it *did* work).

A few hours later, at 00:00:20, the first MWExceptions joined Miraheze: 2016-02-02 00:00:20 mw1 allthetropeswiki: [da053abd] /wiki/File:4-tails_2841.png  MWException from line 337 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'css' The following wikis were affected (along with exceptions count): root@mw1:/var/log/mediawiki/debuglogs# grep -a 'Error: invalid magic word' exception.log | awk '{print $4}' | cut -d":" -f1 | sort | uniq -c | sort -nr 5910 allthetropeswiki 878 spiralwiki 512 boulderwikiwiki 453 metawiki 381 poserdazfreebieswiki 274 aktposwiki 83 antiguabarbudacalypsowiki 69 testwiki 61 specialeducationwiki 46 tyrolmountainswiki 29 cssandjsschoolboardwiki 22 partupwiki 10 szkwiki 4 reviwiki 2 rawdatawiki 2 quantixwiki 2 developmentwiki 1 worldofkirbycraftwiki 1 webflowwiki 1 stoutofreachwiki 1 spiraltestwiki 1 poserdazfreebiestestwiki 1 micropediawiki
 * 1) 0 /srv/mediawiki/w/includes/MagicWord.php(262): MagicWord->load(string)
 * 2) 1 /srv/mediawiki/w/includes/parser/Parser.php(5212): MagicWord::get(string)
 * 3) 2 /srv/mediawiki/w/extensions/CSS/CSS.php(124): Parser->setFunctionHook(string, string)
 * 4) 3 [internal function]: wfCSSParserFirstCallInit(Parser)
 * 5) 4 /srv/mediawiki/w/includes/Hooks.php(201): call_user_func_array(string, array)
 * 6) 5 /srv/mediawiki/w/includes/parser/Parser.php(328): Hooks::run(string, array)
 * 7) 6 /srv/mediawiki/w/includes/cache/MessageCache.php(1035): Parser->firstCallInit
 * 8) 7 /srv/mediawiki/w/includes/cache/MessageCache.php(1064): MessageCache->getParser
 * 9) 8 /srv/mediawiki/w/includes/Message.php(1110): MessageCache->parse(string, Title, boolean, boolean, Language)
 * 10) 9 /srv/mediawiki/w/includes/Message.php(792): Message->parseText(string)
 * 11) 10 /srv/mediawiki/w/includes/Message.php(848): Message->toString
 * 12) 11 /srv/mediawiki/w/includes/page/ImagePage.php(403): Message->parse
 * 13) 12 /srv/mediawiki/w/includes/page/ImagePage.php(154): ImagePage->openShowImage
 * 14) 13 /srv/mediawiki/w/includes/actions/ViewAction.php(44): ImagePage->view
 * 15) 14 /srv/mediawiki/w/includes/MediaWiki.php(490): ViewAction->show
 * 16) 15 /srv/mediawiki/w/includes/MediaWiki.php(287): MediaWiki->performAction(ImagePage, Title)
 * 17) 16 /srv/mediawiki/w/includes/MediaWiki.php(714): MediaWiki->performRequest
 * 18) 17 /srv/mediawiki/w/includes/MediaWiki.php(508): MediaWiki->main
 * 19) 18 /srv/mediawiki/w/index.php(41): MediaWiki->run
 * 20) 19 {main}

At 06:48:13 the MWExceptions stopped because I rebuilt the localisation cache. A few hours later I depooled mw1 for troubleshooting.

None of us know what exactly happened, and why it happened. I think that this commit could have been the culprit here.

The sad part about this downtime is that it took 7 hours before someone actually noticed it and fixed the issue. System administrators should get notified if something happens, but since we have no monitoring (neither do we have a Varnish setup that automatically depooled mw1), no one took action unfortunately.

Timeline

 * 1 February
 * 19:32 Southparkfan: merged potential cause of incident
 * 2 February
 * 00:00 Southparkfan: first MWExceptions by mw1
 * 06:48 Southparkfan: rebuilt LC cache on mw1 for English language, problems fixed for now
 * 06:52 Southparkfan: reverted potential cause of issues
 * 14:56 Southparkfan: depooled mw1 for troubleshooting

Conclusions

 * MediaWiki failed to properly handle localisation cache issues
 * Miraheze visitors failed to notify Miraheze staff during issues
 * Miraheze staff failed to spot the issue in a reasonable time
 * Icinga failed to notify us, because we lack good monitoring
 * Varnish failed to mark mw1 as sick and depool it, because we have no (good) health checks

Actionables

 * #351: design a better localisation cache system
 * #77: add health checks to Varnish and ensure they work
 * #78: better monitoring for Icinga
 * Make Miraheze's visitors, volunteers and wiki founders aware how to contact the Miraheze System Administrators during trouble

Meta

 * Incident handled by: Southparkfan
 * Affected services: all web requests for the wikis listed above that were routed to mw1
 * Signature: Southparkfan (talk) 17:35, 3 February 2016 (UTC)