71 private links
Very Rust focused, still it's an interesting debate. It gives a good overview of the different types of lock behaviors in case of failures. It's very much advocating for the poisoning approach which is indeed an interesting one (coming with its own tradeoffs of course).
Decades that our industry doesn't improve its track record. But there are real consequences for users. Some more ethics would be welcome in our profession.
A bit of a shameless plug toward the end. That said the explanations of why Cloudflare is banking on Rust so much or how the recent downtime could have been avoided are spot on.
Error handling is not easy. Having simple rules to apply for complex systems is a good thing. Of course the difficulty is to apply them consistently.
Interesting point of view. Indeed, you probably want things to not be available 100% of the time. This forces you to see how resilient things really are.
Depending on the ecosystem it's more or less easy indeed. Let's remember that error handling is one of the hard problems to solve.
If it fails for everyone then it's not a bad choice on your part, right?
Everyone makes mistakes eventually, the real difference is in how you deal with them.
Clearly the error handling landscape still evolves in Rust and that's a good thing. The current solutions are too fragmented at the moment.
Matrix.org - How we discovered, and recovered from, Postgres corruption on the matrix.org homeserver
Wow, this was a really bad index corruption indeed.
Mistakes happen, but shrugging them off with blaming people or pushing them to be more careful is counter-productive. Instead, you want to find the organizational issues which made them possible in the first place.
Interesting paper about metastable failures and novel approaches to try to analyse them. It's early days but we would need to get toward better prevention.
A couple of flaws in this article I think. For instance, the benchmark part looks fishy to me. Also it's a bit opinionated and goes too far in advocating exceptions at the expense of error values.
Still, I think it shows quite well that we can't do without exceptions at all, even in the case of error values being available. In my opinion, we're still learning how both can be cleverly used in code base.
Retries are becoming common place to deal with transient errors. That said, they can be a problem with recovery of longer failures due to amplification. There are options on the table to solve this though.
A nice zine introducing the topic of faults and failures in distributed systems.
Strange things do happen when the hardware fails... indeed the systemd open question at the end is mysterious.
Not a reason to make no effort into having as proper error messages as possible. Still there's some truth there that trying to have a really useful error message is a fool's errand.
This is definitely an ambiguous term. You need to know where stand the people employing it in order to figure out the exact meaning of "root cause".
Very good piece. Explains why postmortems are important. It also explains how to prepare your organization to conduct them and how to do them properly. This is important since a lot of pressure will happen in case of a failure.
OK, definitely a gutsy move... Still this is an interesting approach for a complex system. Better have a controlled early failure if you can get it, than a complete collapse later on. This might be just the incentive you need for real organizational change.