Interesting experiment on how to totally break the performance of memory accesses. This gives good insights on the whole chain works.
Nice exploration of floating point arithmetic all the way down to the silicon.
Wondering about NPUs architecture and how they work? This is a good in depth reference article I think.
Not all CPUs are born equal in term of branch prediction. Interesting little benchmark.
Nice little introduction in the fascinating world of very large binaries.
Maybe we have a path forward for performance stackful coroutine? More pieces need to fall in place but this looks promising.
Interesting to see how it behaves in practice when passing parameters by value. Turns out there are surprising patterns in the data.
If you're wondering why the architecture is called "amd64" and why the itanium disappeared... this is why. It was a very good stunt from AMD back then.
Interesting trend in the CPU space. We're getting more simultaneous instructions with the passing generations.
Indeed, CPU prefetchers are really good nowadays. Now you know what to do to keep your code fast.
SIMD instructions are indeed a must to get decent performance on current hardware.
A good example of how you can get bitten by cache coherency algorithms in the CPU.
A bit dated perhaps, and yet most of the lessons in here are still valid. If performance and parallelism matter, you better keep an eye on how the cache is used.
Nice trick for highly performance sensitive data structures. Making data CPU local instead of thread local you can make a mechanism which is especially cache friendly.
Nice exploration of the microcode patching flaw which was disclosed recently. This gives a glimpse at the high level of complexity the x86 family brings on the table.
Nice primer on the impact of too many branches in your code on the CPU. This is sometimes a good way to boost performance when you're mindful about that.
It's interesting to see such a reverse engineering of this infamous bug straight from the gates layout.
Fascinating research about side-channel attacks. Learned a lot about them and website fingerprinting here. Also interesting the explanations of how the use of machine learning models can actually get in the way of proper understanding of the side-channel really used by an attack which can prevent developing actually useful counter-measures.
Data layout is essential for performance reasons. It is too often overlooked. If you want real speed you need to help the memory subsystem.
Nice list of common portability issues one can encounter at the machine architecture level. But don't be fooled, this doesn't have implications only for C and C++, those problems leak in higher level languages as well.