Tech:Incidents/2019-01-11-redis-down

DRAFT Due to a configuration change in systemd that should have been applied long time ago and an unknown factor finally letting redis fail, redis and services dependent on redis were broken for about 35 minutes.

Summary
Provide a summary of the incident:
 * What services were affected?
 * All services dependent on redis. (MediaWiki sessions/login + JobRunner)
 * How long was there a visible outage?
 * 14:17 till 14:52, so about 35 minutes
 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Initially human error, since we should have introduced a change to the systemd unit, although it ran fine for months without that change so something unknown finally triggered the error.
 * Was the incident aggravated by human contact, users or investigating?
 * No.

Timeline

 * 14:15: paladox rebooted misc2 to clear some ram as it was full.
 * 14:17: redis can't save database (read-only file system) thus refuses to save keys
 * 14:36: Southparkfan notices login is broken
 * 14:43: Southparkfan assumes redis is full, thus introducing patches to MediaWiki core to decrease memory usage
 * 14:46: Southparkfan notices the actual issue is redis not being able to write to the database
 * 14:52: Southparkfan disables puppet, introduces a patch to the systemd file, restarts redis - redis back online with the database more or less intact

Conclusions

 * A missing configuration change (ReadWriteDirectories inside the systemd file) should have been applied at the very least in June 2018, when paladox enabled syncing the database to disk every 60 seconds
 * Due to an unknown reason, there were no issues until more than 6 months later, when paladox rebooted misc2 and redis finally refused to save keys to its cache

Actionables

 * Apply configuration change to systemd: ✅ 784d8e2ec7ab35c2de11a1f4b0c2cc83f356061d

Meta

 * Who responded to this incident?
 * Paladox and Southparkfan
 * What services were affected?
 * MediaWiki (sessions/login) JobRunner.
 * Who, therefore, needs to review this report?
 * John or Southparkfan.
 * Timestamp.