Error handling is still not a properly solved problem in my opinion. At least the Rust community discusses the topic quite a bit. This is good inspiration for other ecosystems as well I think.
Looks like a nice tool to monitor your Proxmox install.
How to avoid drowning in errors when getting serious about monitoring. Finding class of errors and treating them one by one will definitely help.
This looks like an interesting OS level monitoring solution.
Looks like a nifty little tool for sending notifications from a script to your phone or such.
This looks like an interesting intrusion detection tool. I like the overall approach they chose.
A neat little catalogue of monitoring tools on Linux. Learned a couple of them I didn't know of.
Obviously didn't read it all but this is a very large knowledge repository of practices from many companies one can get inspired by to work on Site Reliability Engineering. It is especially comprehensive since it's not only about technical tips but also deals with hiring, team building and culture (which is almost as important if not more).