Nice article. It's a good reminder that the benchmarks used to evaluate generative AI systems have many caveats.