Anthropic can now track the bizarre inner workings of a large language model

4345 shaares

Filters

Links per page

20 50 100

Anthropic can now track the bizarre inner workings of a large language model

This is very interesting research. This confirms that LLMs can't be trusted on any output they make about their own inference. The example about simple maths is particularly striking, the real inference and what it outputs if you ask about its inference process are completely different.

Now for the topic dearest to my heart: It looks like there's some form of concept graph hiding in there which is reapplied across languages. Now we don't know if a particular language influences that graph. I don't expect the current research to explore this question yet, but looking forward to someone tackling it.

tech · ai · machine-learning · gpt · research · language

April 11, 2025 at 8:37:35 AM GMT+2 * · permalink

https://www.technologyreview.com/2025/03/27/1113916/anthropic-can-now-track-the-bizarre-inner-workings-of-a-large-language-model/

Filters

Links per page

20 50 100