Tech:Incidents/2018-11-09-data-loss-on-nonsensopediawiki

Summary
Provide a summary of the incident:


 * What services were affected?
 * Files (mediawiki (LizardFS))


 * How long was there a visible outage?
 * Possibly 1-2 days (no accurate timing as it was only discovered now.


 * Was it caused by human error, supply/demand issues or something unknown currently?
 * Yes it was caused by human error.


 * Was the incident aggravated by human contact, users or investigating?
 * Nope.

Quick facts
Provide any relevant quick facts that may be relevant:
 * Are there any known issues with the service in production?
 * Yes, bacula. We are not correctly backing up static (only in a way that would allow us to restore everything).


 * Was the cause preventable by us?
 * This could have been prevented by double checking commands you run (including if you think it's ok)


 * Have there been any similar incidents?
 * No.

Conclusions
Provide conclusions that have been drawn from this incident only:
 * Was the incident preventable? If so, how?
 * Yes, take better care of deleting stuff.


 * Is the issue rooted in our infrastructure design?
 * No.


 * State any weaknesses and how they can be addressed.
 * Backups, because we backup chunks, we could only restore all wikis or none at all.


 * State any strengths and how they prevented or assisted in investigating the incident.

Actionables
List all things we can do immediately (or in our current state) to prevent this occurring again. Include links to Phabricator issues which should go into more detail, these should only be one line notes! e.g. ": Monitor service responses with GDNSD and pool/depool servers based on these."

T3782 Back up static correctly in a way that allows individual restores for wikis.

Meta

 * Who responded to this incident?
 * Paladox


 * What services were affected?
 * Files (mediawiki)


 * Who, therefore, needs to review this report?
 * Operations


 * Timestamp.