Anthropic publishes two papers in a row: tracking big model thinking with an AI 'microscope'

Anthropic develops an AI "microscope" method, which reveals for the first time the thinking process and information flow paths of the large model Claude by tracking the internal activity patterns of the neural network; the study finds that Claude plans its output in advance, and has the ability to share concepts in multiple languages, parallel computational paths, and multistep inference, rather than simply generating words word by word; the team reveals the internal mechanism of Claude when dealing with "hallucinations" and refusing to answer questions and encountering jailbreak attacks by intervening in experiments, providing a new method to improve the reliability of AI. Through intervention experiments, the team reveals Claude's internal mechanisms for dealing with "hallucinations", refusing to answer, and experiencing jailbreak attacks, providing a new way to improve AI reliability.

Search