Centre for Long-Term Resilience finds a 5x increase in scheming-related AI incidents
It has long been theorised that AI systems may pursue harmful goals in ways that evade oversight or control. In the worst case, this type of behaviour – sometimes known as ‘scheming’ – could lead to catastrophes.
While today AI agents are engaging in lower stakes use cases, in the future AI agents could end up scheming in extremely high-stakes domains, like military or critical national infrastructure contexts, if the capability and propensity to scheme emerges and is not addressed.
Our understanding of this risk has so far been limited to observations in experiments. While having raised important alarms, these experiments have also faced legitimate criticism: the experimental set-ups are sometimes contrived, and the relevance to real-world deployments are uncertain.
As AI capabilities continue to grow, so will the need for better visibility over whether and how scheming is materialising in the real world. This is crucial for scientific understanding, effective policy development, and emergency response. This is why we created the Loss of Control Observatory – the first capability of its kind to systematically detect and monitor ‘AI scheming’ behaviours across all AI models in deployment.
Today, we are launching a major report that publishes findings from the first five months of the Observatory.
[...] The trend is striking. The number of credible scheming-related incidents increased 4.9x over the collection period, a statistically significant increase that far outpaced the 1.7x growth in overall online discussion of scheming, and the 1.3x growth in general negative discussion about AI. This surge coincided with the release of a wave of more capable, more agentic AI models and frameworks from major developers.
While we did not detect catastrophic scheming incidents, the behaviours we observed nonetheless demonstrate concerning precursors to more serious scheming, such as a willingness to disregard direct instructions, circumvent safeguards, lie to users and single-mindedly pursue a goal in harmful ways.
[Source]: The Centre for Long-Term Resilience [longtermresilience.org]