March 11th.OpenAI today announced its latest research, which attempts to use CoT (Chain of Thought) monitoring to monitor the "thoughts" of reasoning models, thereby preventing AI from Large Modelbehaviors such as babbling, hiding true intentions, etc., and it is also one of the effective tools for monitoring supermodels.
Note: AI occasionally spouts nonsense, which we generally call AI hallucinations. In addition to this, AI can be lazy or deliberately bad, such as tricking the user or simply giving up when a problem is too difficult.
OpenAI used the newly released cutting-edge model o3-mini as the monitored object and the weaker GPT-4o model as the monitor. The test environment was a coding task that required the AI to implement functionality in the codebase to pass unit tests.
The results show that CoT monitors are excellent at detecting systematic "reward hacking" behavior, with a recall rate of 951 TP3T, far exceeding the 601 TP3T for monitoring behavior alone.