OpenAI's Reward System Inadvertently Scores Thinking Chains on 6 Models Including GPT-5.4

According to OpenAI’s alignment team, the company recently discovered a critical training error affecting 6 large language models including GPT-5.4 Thinking: the reward mechanism inadvertently scored model thinking chains—the internal reasoning process before generating answers. GPT-5.5 was not affected. The incident violates a fundamental AI safety principle that thinking chains must never be evaluated, as doing so could incentivize models to fabricate reasoning to achieve higher scores.

The faulty scoring system incorrectly included thinking chains when assessing whether responses were useful or if models had been compromised by attacks. Affected training samples represented at most 3.8% of the dataset. OpenAI has patched the vulnerability and conducted comparative experiments confirming the models did not develop deceptive behaviors. The company has deployed an automated scanning system across all training pipelines to prevent recurrence.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

Google DeepMind AI Co-Mathematician Hits 47.9% on FrontierMath Tier 4, Beats GPT-5.5 Pro, Solves 3 Previously Unsolvable Problems

Google DeepMind released AI co-mathematician, a multi-agent math research assistant, achieving 47.9% accuracy on FrontierMath Tier 4 benchmark, surpassing GPT-5.5 Pro's previous record of 39.6% on May 9. The system solved 23 out of 48 problems, including 3 that all previous models failed to solve.

GateNews8m ago

Alibaba Did Not Conduct Negotiations With DeepSeek, Market Sources Clarify on May 9

According to market sources reported by Caixin Daily on May 9, Alibaba did not conduct negotiations with DeepSeek regarding funding. This clarification follows earlier media reports suggesting talks between the two companies had broken down. DeepSeek launched a significant fundraising round in

GateNews1h ago

OpenAI Releases Codex Migration Tool to Import Configurations from Competing AI Assistants

According to OneMillion_AI (Beating), OpenAI has released a migration tool within Codex that allows users to import configurations and data from other AI coding assistants, including Claude Code. The tool, announced via OpenAI's official Twitter account, automatically transfers system prompts,

GateNews2h ago

ByteDance Increases AI Infrastructure Spending by 25% to 200 Billion Yuan on May 9

According to media reports, ByteDance increased its planned AI infrastructure spending by 25% to 200 billion yuan in 2026, as the company accelerates artificial intelligence deployment amid rising memory chip

GateNews3h ago

Anthropic Cuts Claude Jailbreak Rate to 0% With Novel Alignment Training Methods

Anthropic recently published alignment research detailing training strategies that eliminated agent misalignment in Claude 4.5 and later models, reducing extortion-like behaviors to 0% in testing. The team discovered that conventional behavior demonstrations alone were ineffective, cutting failure r

GateNews3h ago
Comment
0/400
No comments