Tech talk:Incidents

https://meta.miraheze.org/wiki/Talk:Miraheze#The_cost_of_downtime

Backup and publish the incident reports to another server
These reports should be available when Miraheze is down.

Please provide an alternative URL for when Miraheze is unavailable.

Please identify all urls that should be available at a secondary site when Miraheze down.

Please make sure that all system notices have clickable URLs to get to the details of the notice, and other important information.

Do we really need the incident report table?
We will experience incidents, now and then, but do we really need to document every single incident? Descriptive title with list of subpages should be enough (as it is now). &mdash; revi  18:04, 15 November 2017 (UTC)
 * Removed. &mdash; revi  07:29, 21 November 2017 (UTC)

Where should brief incidents be posted??
[06:42]  did meta have a brief outage recently? [06:43] <+Reception123> Rob_Sterbal: yes, there was a 3 minute outage earlier today due to a bad configuration change [06:43] * videojeux4 (uid240770@miraheze/videojeux4) has joined #miraheze [06:44]  where should I document that on meta? [06:45] <+Reception123> I think you should read comments on https://meta.miraheze.org/wiki/User_talk:Rsterbal/What_belongs_in_Meta%3F [06:45]  I have [06:46]  I don't think I should post something there [06:46]  I simply asked where a 3 minute outage should be reported on Meta [06:47]  Instead you posted a link that I don't think is helpful.

Rsterbal (talk) 11:49, 9 December 2017 (UTC)

Please note regarding the suggestion of looking things up on icinga:

[08:22]  it looks like the link to icinga requires authentication: The tool can be accessed here: https://icinga.miraheze.org (authentication required) [08:23]  https://meta.miraheze.org/wiki/Tech:Icinga

By the way: [08:25] <+Reception123> Rob_Sterbal: you can access it by using the username "guest" and no password

Rsterbal (talk) 13:24, 9 December 2017 (UTC)
 * What is the point of having 1 minute or 40 seconds downtime recorded on Incident? What does it achieve? From the conclusions @ Tech:Incidents/Template...


 * Was the incident preventable? If so, how?
 * Is the issue rooted in our infrastructure design?
 * State any weaknesses and how they can be addressed.
 * State any strengths and how they prevented or assisted in investigating the incident.
 * What do we learn from the 1 minute downtime - i.e. bad configuration change - other than 'do it better next time'? Unless you can give me a good reason with merit (merit here means 'we can improve this and prevent this kind of outage') I do not think such downtime is not worth recording on Incident. Of course, you are free to document it on your own userspace, or on your own wiki. &mdash; revi  11:07, 11 December 2017 (UTC)