Search: [reliability] - ervin's web review

On Metastable Failures and Interactions Between Systems – Aleksey Charapko

A good explainer on what metastable failures are and how to try to mitigate them.

tech · distributed · failure · reliability

December 25, 2025 at 10:46:30 AM GMT+1 · permalink

·

https://charap.co/on-metastable-failures-and-interactions-between-systems/

·

Norris Numbers

This is something I've definitely seen indeed. There are clearly a threshold effect in the amount of code you have to manage. Solutions working at smaller amounts don't work anymore a couple of order of magnitudes higher, and vice versa.

tech · quality · reliability · maintenance · team · organisation

December 18, 2025 at 12:41:48 PM GMT+1 * · permalink

·

https://www.teamten.com/lawrence/writings/norris-numbers.html

·

pgFirstAid - PostgreSQL Health Check

Looks like a nice kit to add to your tool belt. Does some handy checks if you have a Postgres database to manage.

tech · databases · postgresql · reliability · performance · health

November 30, 2025 at 9:43:44 AM GMT+1 · permalink

·

https://randoneering.tech/blog/pgfirstaid/pgfirstaid/

·

How Cloudflare uses Rust to serve (and break) millions of websites at 50+ millions requests per second

A bit of a shameless plug toward the end. That said the explanations of why Cloudflare is banking on Rust so much or how the recent downtime could have been avoided are spot on.

tech · cloudflare · rust · reliability · failure

November 27, 2025 at 10:21:42 AM GMT+1 * · permalink

·

https://kerkour.com/how-cloudflare-uses-rust

·

What Now? Handling Errors in Large Systems

Error handling is not easy. Having simple rules to apply for complex systems is a good thing. Of course the difficulty is to apply them consistently.

tech · reliability · failure · complexity

November 22, 2025 at 11:04:13 AM GMT+1 · permalink

·

https://brooker.co.za/blog/2025/11/20/what-now.html

·

Brownouts reveal system boundaries

Interesting point of view. Indeed, you probably want things to not be available 100% of the time. This forces you to see how resilient things really are.

tech · infrastructure · reliability · failure · resilience

November 21, 2025 at 11:44:58 AM GMT+1 * · permalink

·

https://jyn.dev/brownouts-reveal-system-boundaries/

·

Cursed Knowledge

Interesting approach for a project to collect such traps in there dependencies like this.

tech · reliability · communication · dependencies

August 9, 2025 at 7:58:13 AM GMT+2 * · permalink

·

https://immich.app/cursed-knowledge/

·

The Nuanced Reality of Throttling: It's Not Just About Preventing Abuse

This is a good look at the reasons behind throttling. If you accept a less naive model than "preventing abuse", you can build a better throttling strategy.

tech · api · services · reliability

June 17, 2025 at 8:39:40 AM GMT+2 * · permalink

·

https://blog.joemag.dev/2025/06/the-nuanced-reality-of-throttling-its.html

·

Is Winter Coming?

If the funding dries up... we'll have another AI winter on our hands indeed.

tech · ai · machine-learning · gpt · reliability · business

May 20, 2025 at 10:50:54 AM GMT+2 * · permalink

·

https://www.datagubbe.se/winter/

·

Using unwrap() in Rust is Okay

I find the title misleading. Still, this is a good exploration of how to treat unwrap() and expect() in Rust code.

tech · rust · reliability · safety

May 17, 2025 at 10:18:45 PM GMT+2 * · permalink

·

https://burntsushi.net/unwrap/

·

Stability by Design

Illustrated with the Clojure ecosystem, bit there's nothing inherently specific to the language here. If you want to ensure stability to your users, you need to manage your APIs properly and this article put forward a couple of interesting ideas.

tech · clojure · reliability · api · library

May 9, 2025 at 10:04:41 AM GMT+2 * · permalink

·

https://potetm.com/devtalk/stability-by-design.html

·

You can’t prevent your last outage, no matter how hard you try

This is almost by definition. The post mortem needs to be wisely crafted to look also at previous incidents and the actions to mitigate them.

tech · reliability

May 5, 2025 at 8:39:54 AM GMT+2 * · permalink

·

https://surfingcomplexity.blog/2025/05/04/you-cant-prevent-your-last-outage-no-matter-how-hard-you-try/

·

What if we embraced simulation-driven development?

At some point the complexity is high enough that you indeed need more tools than only handcrafted tests to discover bugs.

tech · tests · distributed · reliability · simulation · complexity

April 29, 2025 at 10:11:01 AM GMT+2 * · permalink

·

https://pierrezemb.fr/posts/simulation-driven-development/

·

Post Apocalyptic Computing

Interesting rambling and exploration. What would a computer built to last a century look like?

tech · low-tech · history · reliability

March 25, 2025 at 10:03:12 AM GMT+1 * · permalink

·

https://thomashunter.name/posts/2025-03-23-post-apocalyptic-computing

·

Groundbreaking BBC research shows issues with over half the answers from Artificial Intelligence (AI) assistants

Interesting research, looking forward to the follow ups to see how it evolves over time. For sure the number of issues is way to high still to make trustworthy systems around search and news.

tech · ai · machine-learning · gpt · reliability · research

February 18, 2025 at 9:56:49 AM GMT+1 * · permalink

·

https://www.bbc.com/mediacentre/2025/bbc-research-shows-issues-with-answers-from-artificial-intelligence-assistants

·

Does current AI represent a dead end?

This highlights quite well the limits of the models used in LLMs.

tech · ai · machine-learning · gpt · complexity · emergence · reliability

December 28, 2024 at 4:21:24 PM GMT+1 · permalink

·

https://www.bcs.org/articles-opinion-and-research/does-current-ai-represent-a-dead-end/

·

Be Suspicious of Success

This is an important trait to have for a developer. If you're content of things working without knowing why and how they work, you're looking for a world of pain later.

tech · reliability · tests · debugging

October 17, 2024 at 9:16:24 AM GMT+2 * · permalink

·

https://buttondown.com/hillelwayne/archive/be-suspicious-of-success/

·

Modes Considered Harmful

Interesting point. You likely need to be careful with fallback modes especially in distributed systems. They might bring even more issues when the system is already under stress.

tech · distributed · reliability

October 2, 2024 at 10:43:34 AM GMT+2 * · permalink

·

https://a-nickels-worth.dev/posts/modesharm/

·

Resilient Microservice Applications, byDesign, and without the Chaos

I'm obviously not in love with the complexity this type of architecture brings. That being said, this thesis brings an interesting approach to better detect failure scenarios in such systems.

tech · microservices · reliability · research · architecture

September 27, 2024 at 11:24:23 AM GMT+2 * · permalink

·

https://christophermeiklejohn.com/publications/cmeiklej_phd_s3d_2024.pdf

·

Why Playwright is less flaky than Selenium

Interesting reason which would explain the Selenium flakiness. It's just harder to write tests with race conditions using Playwright.

tech · web · frontend · tests · performance · reliability

August 30, 2024 at 3:11:40 PM GMT+2 * · permalink

·

https://justin.searls.co/links/2024-08-29-why-playwright-is-less-flaky-than-selenium/

·