Mistakes happen, but shrugging them off with blaming people or pushing them to be more careful is counter-productive. Instead, you want to find the organizational issues which made them possible in the first place.
Interesting paper about metastable failures and novel approaches to try to analyse them. It's early days but we would need to get toward better prevention.
A couple of flaws in this article I think. For instance, the benchmark part looks fishy to me. Also it's a bit opinionated and goes too far in advocating exceptions at the expense of error values.
Still, I think it shows quite well that we can't do without exceptions at all, even in the case of error values being available. In my opinion, we're still learning how both can be cleverly used in code base.
Retries are becoming common place to deal with transient errors. That said, they can be a problem with recovery of longer failures due to amplification. There are options on the table to solve this though.
A nice zine introducing the topic of faults and failures in distributed systems.
Strange things do happen when the hardware fails... indeed the systemd open question at the end is mysterious.
Not a reason to make no effort into having as proper error messages as possible. Still there's some truth there that trying to have a really useful error message is a fool's errand.
This is definitely an ambiguous term. You need to know where stand the people employing it in order to figure out the exact meaning of "root cause".
Very good piece. Explains why postmortems are important. It also explains how to prepare your organization to conduct them and how to do them properly. This is important since a lot of pressure will happen in case of a failure.
OK, definitely a gutsy move... Still this is an interesting approach for a complex system. Better have a controlled early failure if you can get it, than a complete collapse later on. This might be just the incentive you need for real organizational change.
A bit heavy handed in the way it tries to paint Root Cause Analysis as evil. Still it has good points about its shortcomings. In particular I appreciate the emphasis on complexity which indeed points to have contributing factors and unexpected outcomes. Definitely things to keep in mind for any postmortem efforts.
Interesting story on how power plays can sometimes completely hide the fate of a project until it's too late. Definitely a cautionary tale.
If you want to get to the bottom of a problem and of why an accident happen, people need psychological safety. This is indeed necessary if you want them to share truthfully why the accident happened in the first place. Otherwise fear will drive the conversation and hide important facts.