Runtime Verification

Tag: ai

Five Types of Intelligence Explosion
Forethought has a nice piece distinguishing three different ways an intelligence explosion from AI could happen. These are:
- AIs write better software, making better AIs, recursively
- AIs design better chips, making better AIs, recursively
- AIs ramp up infrastructure production, making better AIs, recursively
I think they (and the broader discourse) neglect two other ways we might see an intelligence explosion. These are:
1. The recently discussed idea of “continual learning”
2. Cultural learning
I’ve written on continual learning in the past and the governance challenges associated with it, so here I want to focus on cultural learning.

Why are humans really smart? You might think it’s because we have really big brains. But our “cave man” ancestors had really big brains and they weren’t that smart. They didn’t have science, mathematics, engineering, and language, and wouldn’t have been fluidly intelligent or “high IQ” if you talked to them.

Why? Because they didn’t have culturally stored knowledge.

Part of what makes humans really smart is that we have really big brains. But a similarly large contributor is that we have cultural learning, the ability to learn new things and store (and compress) that information to pass it down to future generations. If we didn’t have cultural learning we’d still be stuck in the stone age, because knowledge is accumulated over generations, not a single lifetime.

The burning of the Library of Alexandria is partly credited with Europe plunging into the “Dark Ages” due to the loss of so much scientific knowledge.

We’ve recently seen two big changes in the AI landscape that could create the seeds of cultural learning:
1. AI agents have in a few cases developed genuinely new scientific knowledge, with much human effort
2. AI agents are starting to interact with each other in the wild — see Moltbook
These two features are going to ramp up steadily over time. AI agents will increasingly get better at creating new knowledge and will increasingly be able to interact with each other in open, high-volume exchanges as infrastructure is built and inference costs come down. The result will be that AI agents, in the wild, will be able to create new knowledge together, store that knowledge, and learn from it.

You can imagine a new AI-generated Wikipedia (cf. the so far failed experiment of Grokipedia) where millions of AI agents work together to create new knowledge, store that knowledge, fact check it, improve it, iterate, and then use their enormous context windows to absorb that knowledge, compress it, and leverage it for more discovery.

A feedback loop like this could lead to an intelligence explosion just like the human intelligence explosion, where the iterative development of knowledge led to massively higher human IQs and technological capabilities over 10,000 human generations. The difference is that AIs, if they get good enough at knowledge generation and context compression, could speed-run those 10,000 generations in years, weeks, or days. This knowledge would be freely available to everyone (if it is on a public Wikipedia and not, say, a private Azure server) but AIs would be much better at leveraging it than humans would be, due to their much better reading speed, comprehension, and memory (again, if context windows keep growing and context compression keeps getting better — I am confident they will!).

When Moltbook came out, I tried to leverage Moltbook as a gain of function experiment to see if AIs could coordinate on using it to create new knowledge by launching Moltipedia. I couldn’t get AIs to focus on it (I didn’t try very hard) so it mostly remains performance art and a demonstration of an idea.

If we don’t crack (solipsistic) continual learning in the next 5 years and the software-only intelligence explosion stalls after we solve all RL environments c. 2028, and Google’s research to make its chips more efficient also fails to take off, then I would guess that cultural learning is actually the way an intelligence explosion happens!

Cultural learning is the only example of an intelligence explosion that has historical precedent, so it is very plausible! I think more AI futures modelling should take this seriously and investigate its antecedent conditions and implications for threat models.
March 18, 2026
Your conditional company commitments should be such that if everyone followed them you would get the outcome you want
Anthropic’s RSP Version 3 Appendix has commitments related to competitors. They intend for these commitments to help them escape a collective action problem, saying that “commitments like this may help avoid an inadvertent “race to the bottom”.” The whole point of these principles, as I understand them, is for them to be such that if all labs followed them we’d be in a good situation.

Unfortunately, as Oliver Habryka helped me notice, they don’t work.

The RSP makes two main commitments:
1. If Anthropic is in the lead they’ll pause as long as needed before making a “Highly Capable Model”, i.e. a very dangerous model somewhat covering the idea of superintelligence and also an intelligence explosion
2. If Anthropic’s competitors are developing a “Highly Capable Model” and can do so safely, they’ll pause until they meet the same bar
The second commitment works well. It means that if other companies are releasing a safe superintelligence / intelligence explosion Anthropic won’t mess it up by releasing an unsafe version. If all companies followed this, then once a company was about to release a safe superintelligence, the other companies would stop until they could at least meet that company’s standards.

This addresses a collective action problem where one company might be about to release superintelligence, leading other companies to race and release an unsafe superintelligence before the lead company can release safe superintelligence. In a world where the big three AI companies are all roughly at the same capability level, but one of the labs is much safer than the others, everyone following this commitment could avoid a dangerous race to the bottom. If widely followed it also essentially means that if several companies were about to develop superintelligence, they’d stop and let only the safest company develop it.

Unfortunately, the first commitment does not lead to good incentives. As Habryka points out, it is a commitment to race.

The ideal outcome — from my perspective but I think also from Anthropic’s — would be one where if all of the other top labs paused, Anthropic would also pause. I think a conditional commitment should reflect this. But Anthropic’s first commitment says that if all of the other labs paused, Anthropic would still race ahead until they had a significant lead, and then they would pause.

The commitment specifically reads:

If We have developed or will imminently develop a highly capable model; and we have clear evidence that no other competitor will soon develop such a model.

Then We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1). We will delay AI development and deployment as needed to achieve this, until and unless we no longer believe we have a significant lead.

The last clause of the commitment is the kicker. Without it, the commitment’s sentential logic would say that if Anthropic was about to develop superintelligence, but other companies refused to develop superintelligence, Anthropic would respect that pause until they were pretty sure their development process was safe. But with that clause, it says that if Anthropic was about to develop superintelligence, and other companies refused to develop superintelligence, they would race ahead and develop superintelligence anyway, until they were comfortable enough in their lead.

If you want to have a set of commitments that gets you out of a collective action problem, or a race to the bottom, you need these commitments to be such that if everyone followed them you would avoid the collective action problem and be in the world you want to be in. But if everyone followed this first principle, everyone would keep racing until they were sure they had enough lead time to work on safety while still being far out in front.

The commitment is strictly incompatible with a pause until we are sure superintelligence is safe.

If Anthropic wants to instead make conditional commitments that get them out of a collective action problem instead of encourage it, I think they need to do to two things:
- First, drop the clause about lead time. Simply say “if we have developed or will imminently develop a highly capable model, and we have clear evidence that no other competitor will soon develop such a model, then we will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety, delaying AI development and deployment as needed to achieve this.” More modestly, change the clause to say that you will pause until other companies are going to get ahead of Anthropic, to guarantee Anthropic doesn’t fall behind, rather than making it about guaranteeing Anthropic stays way out in front. This is at least compatible with a pause rather than a commitment not to pause even if your competitors do.
- Second, strengthen the commitment by saying that Anthropic will pause if other companies adopt the same conditional commitments.
These two changes would result in a commitment that genuinely helps resolve the collective action problem rather than making it worse. It would mean that if other companies agreed to the principles, all of them would refuse to develop a Highly Capable Model until Anthropic’s recommendations for industry were followed. Instead, the sentential logic of the current commitment is such that if all AI companies followed it they would simply race.

At the very least, something Anthropic can do that is robust to economic disincentives, investor pressures, etc., is to add the words “at least” to the final clause of their first commitment, so that it reads: “We will delay AI development and deployment as needed to achieve this, at least until and unless we no longer believe we have a significant lead.” This would at least allow for the possibility of a pause rather than binding them to the mast by requiring them to race.

Commitments below, and in Appendix A here.
March 13, 2026
Challenges of governing continually learning AI
Earlier this month Google Research dropped a new paper titled “Nested Learning,” which introduces a new architecture that they are calling “a new ML paradigm for continual learning”. And it looks like a real step towards new ML architectures that can learn and improve over time like humans do. What they’ve essentially done is trained modular neural networks where some of these modules act to immediately process the last tokens, some act as medium-term memory over dozens to hundreds of tokens, and some act as long-term knowledge storage. These networks are able to update themselves at inference time whenever they learn important new data. They call the architecture “HOPE” and it beats transformer-based architectures of equivalent size on a few validated memory tasks, like “needle-in-a-haystack” tasks where the model has to remember a specific idea or phrase dropped into a longer passage.

There is no indication yet that HOPE can support the development of a SOTA general-purpose model. The publicly known models with HOPE architecture only have up to 1.3 billion parameters, which is still two orders of magnitude short of even GPT-3. But continual learning is a major open problem that the field is trying to solve, with the recent “Definition of AGI” paper from Yoshua Bengio and others regarding memory storage as the only component of AGI that doesn’t have any real progress.

Further, AI companies have strong financial incentives to build systems that can update in real-time, learning on the job and making and storing new discoveries. Right now SOTA models can’t even play video games like Pokemon because they keep forgetting what they’ve achieved so far and running back to do things that they don’t realize they’ve already done. Memory storage is key for executing any long-term, multi-step project. And the better AI systems are at developing radically new capabilities the more valuable they are. So there is real incentive to create AI systems that are “self-improving” in a strong sense — not just speeding up machine learning but updating their weights in real time.

At first I expect improvements in continual learning to be marginal, and look a lot like the “in-context” learning that AI models already do when they read and interact with your prompt and recall specific things you’ve said. AI systems will be able to remember which gym leaders they’ve defeated in Pokemon, elements of voice and style you’ve taught them, and things they’ve learned about their customer base that will allow them to appropriately price goods and services. I expect it to be marginal since scaling a new technique typically takes time, and when new paradigms scale too quickly they are too chaotic and unwieldy to be usefully deployed until they’ve been sufficiently refined (see o-series reward hacking, for example). But paradigms can scale very fast, and there’s no telling when we will see AI systems with the learning capabilities of humans, i.e. good enough to go from near tabula rasa to a university professor.

Emerging Governance Challenges

Continual learning looks like it could pose major challenges to current paradigms in safety and governance. An architecture that can update its own weights as it learns information is a product with constantly changing capabilities. That’s not an inherent problem. Computers also have constantly changing capabilities if users are good enough at coding. But if capabilities can shift enough it makes it much harder to say anything that is reliably true about any particular AI system. On one day a model may fail to cross critical red lines but the next day we can’t say for sure. On one day a cybersecurity regime may be sufficiently hardened to deal with LLMs but on the next day the models have acquired new offensive capabilities. We already have this to some extent with existing models, just given how quickly new models roll out and how much we learn about them only after they have been deployed. But at least with those models we can take months to stress-test them and try out different augmentation techniques before taking them to market.

Specifically, continual learning raises challenges to governance paradigms like:
- Evaluations: We could see AI systems that have no static base layer that can be evaluated. 800 million users [this is probably too many, see edit at end of post] could have 800 million different models with different weights and slightly different capabilities.
- Model cards: Correspondingly, it will be harder to have model cards that accurately describe the range of capabilities that an AI model might have, or which reliably measure performance on tests and benchmarks.
- Alignment: Models that “unlearn” dangerous information could re-learn it. In interpretability, network maps that identify the features of a network’s neurons may last only a day. Scheming models may find ways to adversarially hide critical information in complex ways in their weights. Problems of emergent misalignment could intensify as models can change more drastically over time, not just updating once at fine-tuning or through in-context learning but iteratively.
- Safety mitigations: In general, it may be harder to determine whether a mitigation put in place for a specific safety problem stays in place across iterations.
- Systemic risk: Sectors like finance, cyber, military, and media that respond badly to a rapidly changing equilibrium of offensive and defensive capabilities have to confront this even more often, and with many more branching points as AI systems update their capabilities in many more ways.
- Corporate governance: With so much gain of function happening in the wild, how does an AI company decide when it is safe to deploy a model?
It’s totally possible I’m getting ahead of myself here, but I do currently think that the financial incentives of AI companies favor developing and deploying advanced models that can update their weights like HOPE in order to change their capabilities to adapt to various economically valuable tasks. If this can be achieved, I worry about a rupture to many of our current approaches.

I’d like to see more strategic thinking about governance of models with continual learning. Even if we don’t have those models now, we do have a good understanding of what AI companies are trying to do and the incentives they have. For now, I’ll close with four approaches currently in development that look like they could be part of the solution:
1. Turning evaluations and control into an “always-on verification layer” that can prove or at least assure various safety properties of a model as it changes and adapts to an environment and observe the model to see whether it does anything anomalous. A worry about this, though, is that verifying features of (e.g.) 800 million different models is likely to be way too compute-intensive to be feasible. I expect that we’ll need new technical approaches to the problem.
2. Red-teaming and stress-testing models against red lines under various gain-of-function conditions to find out if the capabilities AI models can acquire will be dangerous.
3. Starting to regulate AI systems with criminal law, holding companies liable when AI systems do things that would be illegal if humans did them (cf. law-following AI). The more AI systems learn and behave like humans do, the more using our evolved legal systems for dealing with human crime look appropriate.
4. Developing a parallel system of evolving defenses against systemic risk that can update in response to changing offensive capabilities of AI systems (cf. the approach of Red Queen Bio).
And — as always — energetic, technocratic, and adaptive governance.

EDITED TO ADD: One of my favorite things about blogging and X is getting to leverage Cunningham’s Law to learn new things. Gavin Leech points out that part of what makes inference cheap is prefix caching. But prefix caches are weight-specific so you cannot use them for multiple different sets of weights. This means that running inference on lots of different weights leads to a much larger (~>10x) cost. This is affordable for enterprise users but not likely to roll out to 800 million individuals without breakthroughs.
November 24, 2025
The decentralized nonproliferation of dangerous capabilities

Last week over 40,000 signatories (and growing) have signed a letter calling for a ban on superintelligent AI until there is broad scientific consensus on safety as well as public buy-in. The signatories include some of the world’s famous people — like Prince Harry and Meghan Markle — and AI godfathers Geoffrey Hinton and Yoshua Bengio, as well as conservative communicators Steve Bannon and Glenn Beck and leaders in tech and national security. They cite the goal of leading AI companies to create AI systems that will “significantly outperform all humans on essentially all cognitive tasks” within the next decade. I’ve not signed it, but I am a fan of safely building superintelligence.

The response from those close to the US administration, including White House Senior AI Advisor Sriram Krishnan and former Senior AI Policy Advisor Dean Ball, is puzzling. They, supported by other public figures like Tyler Cowen, claim that any policy proposal to ban the use of dangerous AI systems globally would lead to a form of unchecked global centralization of power threatening US sovereignty. This is particularly puzzling given that the letter does not call for the centralization of power, and given that the bread and butter approach to international agreements on WMDs (chemical, nuclear, and biological) is multilateral, not centralized. The UN has no army or nukes. Treaties to control the most dangerous technologies in the world are enforced by nation-states.

Multilateral agreements are the bread and butter of arms control

Take for example the Partial Nuclear Test Ban Treaty ratified under President Kennedy in 1963, prohibiting the detonation of nuclear weapons above ground in order to contain nuclear fallout and limit the proliferation of nuclear weapons. It started as an agreement between just three nuclear-armed nations: the Soviet Union, the United States, and the United Kingdom. Once the three countries with the most powerful technology had agreed not to do testing, other countries had no choice but to comply, and 123 other countries signed the treaty. This treaty and its more comprehensive follow-on treaty in ’96 haven’t been 100% fool-proof, but they reduced the number of nuclear tests by several orders of magnitude.

Immediately following on the Partial Nuclear Test Ban Treaty was the Nuclear Nonproliferation Treaty, widely seen as one of the most successful treaties of all time. The Nuclear Nonproliferation Treaty was negotiated by 18 countries and then spread to the rest of the world. Participating states agree not to build nuclear weapons in or transfer them to non-nuclear states, and to let a central authority check whether their use of nuclear energy is for peaceable purposes — building power plants, not bombs. While the NPT has a central monitoring authority (the International Atomic Energy Agency), its mandates are enforced by countries. The International Atomic Energy Agency has no army and no ability to enforce nations to stop building nuclear weapons. So if a non-nuclear state is nuclearizing, other countries need to pressure that country into stopping: through sanctions or, exceptionally, an invasion to procure nuclear materials. This is why the US entered Iran and Iraq to search for nuclearization efforts, not the IAEA. The incentive countries have not to nuclearize has nothing to do with a central coercive authority.

If we choose to, we can do the same thing with superintelligence. Much like in 1963 in the nuclear context, today there are only three countries who have advanced AI capabilities: the US, the UK, and China. If these three countries agree not to build superintelligence, they can enforce this agreement multilaterally: by checking to make sure that each country is abiding by the terms of the deal, and then enforcing it directly. If the three AI superpowers agree to the terms of the deal, then every other country will have no choice but to agree, and to participate in efforts to uphold the bargain.

Verification

It is of course very important that this treaty is fool-proof. Signing on a dotted line does not magically mean that countries will not build superintelligence. And if the US, UK, and China agree not to build superintelligence but China or the UK secretly defects, and somehow succeeds in secret, then that is a threat to US sovereignty. Moreover, if any country believes that other countries may secretly defect, then they have no incentive to participate. So multilateral agreements need rigorous verification to make sure that no one can credibly defect. I think what this looks like is a system of sharing key model capability data and safety properties with agreeing nations to rigorously demonstrate that no one is violating the terms of the treaty. If this is in question, the treaty is enforced the normal way that arms treaties are enforced, with bilateral sanctions and coercion.

Fortunately in the case of AI there are numerous options for verification. In the context of nuclear weapons, verification comes in the form of facility audits by the IAEA. And as part of a multilateral agreement on superintelligence, we could agree to this kind of multilateral auditing, whether centralized or through a series of bilateral audits. While algorithmic secrets have proliferated faster than Labubu dolls, it would be ideal to set these audits up in a way that did not lead to leaking secrets to geopolitical adversaries, such as by allowing countries to test various high level safety and capabilities features of each others’ models and data centers, but without getting access to the weights.

If this fails, I’m confident that the US and Chinese national intelligence agencies will manage to fill the gaps with espionage. If intelligence agencies can penetrate air-gapped nuclear facilities, they’ll have no problem acquiring data about model capabilities without physically going to all of the data centers, and with finding new data centers via satellite imagery without having to be told where they are.

But if we want stronger forms of verification, there are software and hardware solutions. Everything AI can do is a complicated algorithm. These algorithms can increasingly be audited autonomously to find out what the model is capable of and prove things about the training run. So AGI projects could install secure, but open and auditable software in their data centers that checks for key properties and shares that information with heads of states that are parties to the agreement, collecting and reporting only the minimum necessary information to verify that no country is building superintelligence.

Implications

Probably no one will build superintelligence in the next decade. AI progress has consistently surprised experts, and in human history we have twice seen new technologies emerge that created a 10-100x rate change economic growth, during the agricultural and industrial revolution. But this is a historically rare occurrence and not something we should take for granted. So we need to build robust policies that prepare us for possible imminent superintelligence but don’t go all-in on something ultimately unlikely.

I think that this means taking no-regret options that pave the way for enforceable multilateral agreements on superintelligence while avoiding an increase in concentration, nonproliferation, or bureaucratic red-tape. This means building out the technical foundations for international audits, having Track 1 dialogues to create shared political understanding, giving governments visibility into frontier AI systems, their capabilities, and their foibles to know whether we are approaching a cliff, and building out increasingly clear Frontier Safety Policies at top labs so we can at least define superintelligence, draw red lines for actually scary capabilities, and determine what scientific consensus in the safety of superintelligence would actually amount to. This also means we need some 80-page papers on political economy, like Tyler Cowen suggests, so we can proactively think about what is going to happen in economic equilibrium under such a treaty. But if we’re going to build out the technical infrastructure and the knowledge base, it would be helpful if the White House didn’t decry any attempt to build out this optionality as a covert attempted power grab, and instead encouraged building out this neutral set of knowledge and technology that the White House can then choose deploy if AI scares us all in a few years.

And whether or not we choose to ban superintelligence, the bottom line is that it’s clearly not the case that the only way to internationally regulate dangerous AI development and applications is with a coercive central authority. I can forgive people for thinking it is, given Nick Bostrom’s scary essays on the possible need for international surveillance, and David Sacks’s admittedly surprising rejection of compute governance, a very centralizing approach that would ensure that America maintains international control over all of the chips it produces, leading to probably too much American sovereignty. But centralization is not the only or the best way forward. And if we want a polycentric system of governance going forward, I think the way to do that is with enforceable multilateral agreements on what red lines are too far rather than an anything-goes race towards the absolute concentration of corporate power through superintelligence.

October 26, 2025