Tag: technology

Challenges of governing continually learning AI
Earlier this month Google Research dropped a new paper titled “Nested Learning,” which introduces a new architecture that they are calling “a new ML paradigm for continual learning”. And it looks like a real step towards new ML architectures that can learn and improve over time like humans do. What they’ve essentially done is trained modular neural networks where some of these modules act to immediately process the last tokens, some act as medium-term memory over dozens to hundreds of tokens, and some act as long-term knowledge storage. These networks are able to update themselves at inference time whenever they learn important new data. They call the architecture “HOPE” and it beats transformer-based architectures of equivalent size on a few validated memory tasks, like “needle-in-a-haystack” tasks where the model has to remember a specific idea or phrase dropped into a longer passage.

There is no indication yet that HOPE can support the development of a SOTA general-purpose model. The publicly known models with HOPE architecture only have up to 1.3 billion parameters, which is still two orders of magnitude short of even GPT-3. But continual learning is a major open problem that the field is trying to solve, with the recent “Definition of AGI” paper from Yoshua Bengio and others regarding memory storage as the only component of AGI that doesn’t have any real progress.

Further, AI companies have strong financial incentives to build systems that can update in real-time, learning on the job and making and storing new discoveries. Right now SOTA models can’t even play video games like Pokemon because they keep forgetting what they’ve achieved so far and running back to do things that they don’t realize they’ve already done. Memory storage is key for executing any long-term, multi-step project. And the better AI systems are at developing radically new capabilities the more valuable they are. So there is real incentive to create AI systems that are “self-improving” in a strong sense — not just speeding up machine learning but updating their weights in real time.

At first I expect improvements in continual learning to be marginal, and look a lot like the “in-context” learning that AI models already do when they read and interact with your prompt and recall specific things you’ve said. AI systems will be able to remember which gym leaders they’ve defeated in Pokemon, elements of voice and style you’ve taught them, and things they’ve learned about their customer base that will allow them to appropriately price goods and services. I expect it to be marginal since scaling a new technique typically takes time, and when new paradigms scale too quickly they are too chaotic and unwieldy to be usefully deployed until they’ve been sufficiently refined (see o-series reward hacking, for example). But paradigms can scale very fast, and there’s no telling when we will see AI systems with the learning capabilities of humans, i.e. good enough to go from near tabula rasa to a university professor.

Emerging Governance Challenges

Continual learning looks like it could pose major challenges to current paradigms in safety and governance. An architecture that can update its own weights as it learns information is a product with constantly changing capabilities. That’s not an inherent problem. Computers also have constantly changing capabilities if users are good enough at coding. But if capabilities can shift enough it makes it much harder to say anything that is reliably true about any particular AI system. On one day a model may fail to cross critical red lines but the next day we can’t say for sure. On one day a cybersecurity regime may be sufficiently hardened to deal with LLMs but on the next day the models have acquired new offensive capabilities. We already have this to some extent with existing models, just given how quickly new models roll out and how much we learn about them only after they have been deployed. But at least with those models we can take months to stress-test them and try out different augmentation techniques before taking them to market.

Specifically, continual learning raises challenges to governance paradigms like:
- Evaluations: We could see AI systems that have no static base layer that can be evaluated. 800 million users [this is probably too many, see edit at end of post] could have 800 million different models with different weights and slightly different capabilities.
- Model cards: Correspondingly, it will be harder to have model cards that accurately describe the range of capabilities that an AI model might have, or which reliably measure performance on tests and benchmarks.
- Alignment: Models that “unlearn” dangerous information could re-learn it. In interpretability, network maps that identify the features of a network’s neurons may last only a day. Scheming models may find ways to adversarially hide critical information in complex ways in their weights. Problems of emergent misalignment could intensify as models can change more drastically over time, not just updating once at fine-tuning or through in-context learning but iteratively.
- Safety mitigations: In general, it may be harder to determine whether a mitigation put in place for a specific safety problem stays in place across iterations.
- Systemic risk: Sectors like finance, cyber, military, and media that respond badly to a rapidly changing equilibrium of offensive and defensive capabilities have to confront this even more often, and with many more branching points as AI systems update their capabilities in many more ways.
- Corporate governance: With so much gain of function happening in the wild, how does an AI company decide when it is safe to deploy a model?
It’s totally possible I’m getting ahead of myself here, but I do currently think that the financial incentives of AI companies favor developing and deploying advanced models that can update their weights like HOPE in order to change their capabilities to adapt to various economically valuable tasks. If this can be achieved, I worry about a rupture to many of our current approaches.

I’d like to see more strategic thinking about governance of models with continual learning. Even if we don’t have those models now, we do have a good understanding of what AI companies are trying to do and the incentives they have. For now, I’ll close with four approaches currently in development that look like they could be part of the solution:
1. Turning evaluations and control into an “always-on verification layer” that can prove or at least assure various safety properties of a model as it changes and adapts to an environment and observe the model to see whether it does anything anomalous. A worry about this, though, is that verifying features of (e.g.) 800 million different models is likely to be way too compute-intensive to be feasible. I expect that we’ll need new technical approaches to the problem.
2. Red-teaming and stress-testing models against red lines under various gain-of-function conditions to find out if the capabilities AI models can acquire will be dangerous.
3. Starting to regulate AI systems with criminal law, holding companies liable when AI systems do things that would be illegal if humans did them (cf. law-following AI). The more AI systems learn and behave like humans do, the more using our evolved legal systems for dealing with human crime look appropriate.
4. Developing a parallel system of evolving defenses against systemic risk that can update in response to changing offensive capabilities of AI systems (cf. the approach of Red Queen Bio).
And — as always — energetic, technocratic, and adaptive governance.

EDITED TO ADD: One of my favorite things about blogging and X is getting to leverage Cunningham’s Law to learn new things. Gavin Leech points out that part of what makes inference cheap is prefix caching. But prefix caches are weight-specific so you cannot use them for multiple different sets of weights. This means that running inference on lots of different weights leads to a much larger (~>10x) cost. This is affordable for enterprise users but not likely to roll out to 800 million individuals without breakthroughs.
November 24, 2025
The decentralized nonproliferation of dangerous capabilities

Last week over 40,000 signatories (and growing) have signed a letter calling for a ban on superintelligent AI until there is broad scientific consensus on safety as well as public buy-in. The signatories include some of the world’s famous people — like Prince Harry and Meghan Markle — and AI godfathers Geoffrey Hinton and Yoshua Bengio, as well as conservative communicators Steve Bannon and Glenn Beck and leaders in tech and national security. They cite the goal of leading AI companies to create AI systems that will “significantly outperform all humans on essentially all cognitive tasks” within the next decade. I’ve not signed it, but I am a fan of safely building superintelligence.

The response from those close to the US administration, including White House Senior AI Advisor Sriram Krishnan and former Senior AI Policy Advisor Dean Ball, is puzzling. They, supported by other public figures like Tyler Cowen, claim that any policy proposal to ban the use of dangerous AI systems globally would lead to a form of unchecked global centralization of power threatening US sovereignty. This is particularly puzzling given that the letter does not call for the centralization of power, and given that the bread and butter approach to international agreements on WMDs (chemical, nuclear, and biological) is multilateral, not centralized. The UN has no army or nukes. Treaties to control the most dangerous technologies in the world are enforced by nation-states.

Multilateral agreements are the bread and butter of arms control

Take for example the Partial Nuclear Test Ban Treaty ratified under President Kennedy in 1963, prohibiting the detonation of nuclear weapons above ground in order to contain nuclear fallout and limit the proliferation of nuclear weapons. It started as an agreement between just three nuclear-armed nations: the Soviet Union, the United States, and the United Kingdom. Once the three countries with the most powerful technology had agreed not to do testing, other countries had no choice but to comply, and 123 other countries signed the treaty. This treaty and its more comprehensive follow-on treaty in ’96 haven’t been 100% fool-proof, but they reduced the number of nuclear tests by several orders of magnitude.

Immediately following on the Partial Nuclear Test Ban Treaty was the Nuclear Nonproliferation Treaty, widely seen as one of the most successful treaties of all time. The Nuclear Nonproliferation Treaty was negotiated by 18 countries and then spread to the rest of the world. Participating states agree not to build nuclear weapons in or transfer them to non-nuclear states, and to let a central authority check whether their use of nuclear energy is for peaceable purposes — building power plants, not bombs. While the NPT has a central monitoring authority (the International Atomic Energy Agency), its mandates are enforced by countries. The International Atomic Energy Agency has no army and no ability to enforce nations to stop building nuclear weapons. So if a non-nuclear state is nuclearizing, other countries need to pressure that country into stopping: through sanctions or, exceptionally, an invasion to procure nuclear materials. This is why the US entered Iran and Iraq to search for nuclearization efforts, not the IAEA. The incentive countries have not to nuclearize has nothing to do with a central coercive authority.

If we choose to, we can do the same thing with superintelligence. Much like in 1963 in the nuclear context, today there are only three countries who have advanced AI capabilities: the US, the UK, and China. If these three countries agree not to build superintelligence, they can enforce this agreement multilaterally: by checking to make sure that each country is abiding by the terms of the deal, and then enforcing it directly. If the three AI superpowers agree to the terms of the deal, then every other country will have no choice but to agree, and to participate in efforts to uphold the bargain.

Verification

It is of course very important that this treaty is fool-proof. Signing on a dotted line does not magically mean that countries will not build superintelligence. And if the US, UK, and China agree not to build superintelligence but China or the UK secretly defects, and somehow succeeds in secret, then that is a threat to US sovereignty. Moreover, if any country believes that other countries may secretly defect, then they have no incentive to participate. So multilateral agreements need rigorous verification to make sure that no one can credibly defect. I think what this looks like is a system of sharing key model capability data and safety properties with agreeing nations to rigorously demonstrate that no one is violating the terms of the treaty. If this is in question, the treaty is enforced the normal way that arms treaties are enforced, with bilateral sanctions and coercion.

Fortunately in the case of AI there are numerous options for verification. In the context of nuclear weapons, verification comes in the form of facility audits by the IAEA. And as part of a multilateral agreement on superintelligence, we could agree to this kind of multilateral auditing, whether centralized or through a series of bilateral audits. While algorithmic secrets have proliferated faster than Labubu dolls, it would be ideal to set these audits up in a way that did not lead to leaking secrets to geopolitical adversaries, such as by allowing countries to test various high level safety and capabilities features of each others’ models and data centers, but without getting access to the weights.

If this fails, I’m confident that the US and Chinese national intelligence agencies will manage to fill the gaps with espionage. If intelligence agencies can penetrate air-gapped nuclear facilities, they’ll have no problem acquiring data about model capabilities without physically going to all of the data centers, and with finding new data centers via satellite imagery without having to be told where they are.

But if we want stronger forms of verification, there are software and hardware solutions. Everything AI can do is a complicated algorithm. These algorithms can increasingly be audited autonomously to find out what the model is capable of and prove things about the training run. So AGI projects could install secure, but open and auditable software in their data centers that checks for key properties and shares that information with heads of states that are parties to the agreement, collecting and reporting only the minimum necessary information to verify that no country is building superintelligence.

Implications

Probably no one will build superintelligence in the next decade. AI progress has consistently surprised experts, and in human history we have twice seen new technologies emerge that created a 10-100x rate change economic growth, during the agricultural and industrial revolution. But this is a historically rare occurrence and not something we should take for granted. So we need to build robust policies that prepare us for possible imminent superintelligence but don’t go all-in on something ultimately unlikely.

I think that this means taking no-regret options that pave the way for enforceable multilateral agreements on superintelligence while avoiding an increase in concentration, nonproliferation, or bureaucratic red-tape. This means building out the technical foundations for international audits, having Track 1 dialogues to create shared political understanding, giving governments visibility into frontier AI systems, their capabilities, and their foibles to know whether we are approaching a cliff, and building out increasingly clear Frontier Safety Policies at top labs so we can at least define superintelligence, draw red lines for actually scary capabilities, and determine what scientific consensus in the safety of superintelligence would actually amount to. This also means we need some 80-page papers on political economy, like Tyler Cowen suggests, so we can proactively think about what is going to happen in economic equilibrium under such a treaty. But if we’re going to build out the technical infrastructure and the knowledge base, it would be helpful if the White House didn’t decry any attempt to build out this optionality as a covert attempted power grab, and instead encouraged building out this neutral set of knowledge and technology that the White House can then choose deploy if AI scares us all in a few years.

And whether or not we choose to ban superintelligence, the bottom line is that it’s clearly not the case that the only way to internationally regulate dangerous AI development and applications is with a coercive central authority. I can forgive people for thinking it is, given Nick Bostrom’s scary essays on the possible need for international surveillance, and David Sacks’s admittedly surprising rejection of compute governance, a very centralizing approach that would ensure that America maintains international control over all of the chips it produces, leading to probably too much American sovereignty. But centralization is not the only or the best way forward. And if we want a polycentric system of governance going forward, I think the way to do that is with enforceable multilateral agreements on what red lines are too far rather than an anything-goes race towards the absolute concentration of corporate power through superintelligence.

October 26, 2025

Tag: technology

Challenges of governing continually learning AI

Emerging Governance Challenges

The decentralized nonproliferation of dangerous capabilities

Multilateral agreements are the bread and butter of arms control

Verification

Implications