Tech:Incidents/2019-01-21-misc2

Summary
Provide a summary of the incident:
 * What services were affected?
 * misc2
 * How long was there a visible outage?
 * 47 minutes
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Excessive bandwidth usage
 * Was the incident aggravated by human contact, users or investigating?
 * No

Timeline
Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.


 * 14:24 information: Icinga reports timeouts for misc2 services.
 * 14:52 John: Sees RamNode have suspended misc2.
 * 15:11 John: Deploys temporary change to drop caching from Redis to DB. Restores service.
 * 15:32 John: Reverts above change immediately after misc2 is back in service.

Quick facts
Provide any relevant quick facts that may be relevant:
 * Are there any known issues with the service in production?
 * Memory usage issues and intermittency but nothing causing long outages.
 * Was the cause preventable by us? With more foresight potentially.
 * Directly, not really.
 * Have there been any similar incidents?
 * None that have caused outages. mw3 and mw1 have been through the same bandwidth issues.

Conclusions
Provide conclusions that have been drawn from this incident only:
 * Was the incident preventable? If so, how?
 * More proactive bandwidth monitoring and prediction.
 * Is the issue rooted in our infrastructure design?
 * No.
 * State any weaknesses and how they can be addressed.
 * None.
 * State any strengths and how they prevented or assisted in investigating the incident.
 * None.

Meta

 * Who responded to this incident?
 * John
 * What services were affected?
 * Redis and thusly MediaWiki.
 * Who, therefore, needs to review this report?
 * Southparkfan
 * Timestamp: 17:00, 21 January 2019 (UTC)