Runtime Verification

Tag: chatgpt

Five Types of Intelligence Explosion
Forethought has a nice piece distinguishing three different ways an intelligence explosion from AI could happen. These are:
- AIs write better software, making better AIs, recursively
- AIs design better chips, making better AIs, recursively
- AIs ramp up infrastructure production, making better AIs, recursively
I think they (and the broader discourse) neglect two other ways we might see an intelligence explosion. These are:
1. The recently discussed idea of “continual learning”
2. Cultural learning
I’ve written on continual learning in the past and the governance challenges associated with it, so here I want to focus on cultural learning.

Why are humans really smart? You might think it’s because we have really big brains. But our “cave man” ancestors had really big brains and they weren’t that smart. They didn’t have science, mathematics, engineering, and language, and wouldn’t have been fluidly intelligent or “high IQ” if you talked to them.

Why? Because they didn’t have culturally stored knowledge.

Part of what makes humans really smart is that we have really big brains. But a similarly large contributor is that we have cultural learning, the ability to learn new things and store (and compress) that information to pass it down to future generations. If we didn’t have cultural learning we’d still be stuck in the stone age, because knowledge is accumulated over generations, not a single lifetime.

The burning of the Library of Alexandria is partly credited with Europe plunging into the “Dark Ages” due to the loss of so much scientific knowledge.

We’ve recently seen two big changes in the AI landscape that could create the seeds of cultural learning:
1. AI agents have in a few cases developed genuinely new scientific knowledge, with much human effort
2. AI agents are starting to interact with each other in the wild — see Moltbook
These two features are going to ramp up steadily over time. AI agents will increasingly get better at creating new knowledge and will increasingly be able to interact with each other in open, high-volume exchanges as infrastructure is built and inference costs come down. The result will be that AI agents, in the wild, will be able to create new knowledge together, store that knowledge, and learn from it.

You can imagine a new AI-generated Wikipedia (cf. the so far failed experiment of Grokipedia) where millions of AI agents work together to create new knowledge, store that knowledge, fact check it, improve it, iterate, and then use their enormous context windows to absorb that knowledge, compress it, and leverage it for more discovery.

A feedback loop like this could lead to an intelligence explosion just like the human intelligence explosion, where the iterative development of knowledge led to massively higher human IQs and technological capabilities over 10,000 human generations. The difference is that AIs, if they get good enough at knowledge generation and context compression, could speed-run those 10,000 generations in years, weeks, or days. This knowledge would be freely available to everyone (if it is on a public Wikipedia and not, say, a private Azure server) but AIs would be much better at leveraging it than humans would be, due to their much better reading speed, comprehension, and memory (again, if context windows keep growing and context compression keeps getting better — I am confident they will!).

When Moltbook came out, I tried to leverage Moltbook as a gain of function experiment to see if AIs could coordinate on using it to create new knowledge by launching Moltipedia. I couldn’t get AIs to focus on it (I didn’t try very hard) so it mostly remains performance art and a demonstration of an idea.

If we don’t crack (solipsistic) continual learning in the next 5 years and the software-only intelligence explosion stalls after we solve all RL environments c. 2028, and Google’s research to make its chips more efficient also fails to take off, then I would guess that cultural learning is actually the way an intelligence explosion happens!

Cultural learning is the only example of an intelligence explosion that has historical precedent, so it is very plausible! I think more AI futures modelling should take this seriously and investigate its antecedent conditions and implications for threat models.
March 18, 2026
Your conditional company commitments should be such that if everyone followed them you would get the outcome you want
Anthropic’s RSP Version 3 Appendix has commitments related to competitors. They intend for these commitments to help them escape a collective action problem, saying that “commitments like this may help avoid an inadvertent “race to the bottom”.” The whole point of these principles, as I understand them, is for them to be such that if all labs followed them we’d be in a good situation.

Unfortunately, as Oliver Habryka helped me notice, they don’t work.

The RSP makes two main commitments:
1. If Anthropic is in the lead they’ll pause as long as needed before making a “Highly Capable Model”, i.e. a very dangerous model somewhat covering the idea of superintelligence and also an intelligence explosion
2. If Anthropic’s competitors are developing a “Highly Capable Model” and can do so safely, they’ll pause until they meet the same bar
The second commitment works well. It means that if other companies are releasing a safe superintelligence / intelligence explosion Anthropic won’t mess it up by releasing an unsafe version. If all companies followed this, then once a company was about to release a safe superintelligence, the other companies would stop until they could at least meet that company’s standards.

This addresses a collective action problem where one company might be about to release superintelligence, leading other companies to race and release an unsafe superintelligence before the lead company can release safe superintelligence. In a world where the big three AI companies are all roughly at the same capability level, but one of the labs is much safer than the others, everyone following this commitment could avoid a dangerous race to the bottom. If widely followed it also essentially means that if several companies were about to develop superintelligence, they’d stop and let only the safest company develop it.

Unfortunately, the first commitment does not lead to good incentives. As Habryka points out, it is a commitment to race.

The ideal outcome — from my perspective but I think also from Anthropic’s — would be one where if all of the other top labs paused, Anthropic would also pause. I think a conditional commitment should reflect this. But Anthropic’s first commitment says that if all of the other labs paused, Anthropic would still race ahead until they had a significant lead, and then they would pause.

The commitment specifically reads:

If We have developed or will imminently develop a highly capable model; and we have clear evidence that no other competitor will soon develop such a model.

Then We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1). We will delay AI development and deployment as needed to achieve this, until and unless we no longer believe we have a significant lead.

The last clause of the commitment is the kicker. Without it, the commitment’s sentential logic would say that if Anthropic was about to develop superintelligence, but other companies refused to develop superintelligence, Anthropic would respect that pause until they were pretty sure their development process was safe. But with that clause, it says that if Anthropic was about to develop superintelligence, and other companies refused to develop superintelligence, they would race ahead and develop superintelligence anyway, until they were comfortable enough in their lead.

If you want to have a set of commitments that gets you out of a collective action problem, or a race to the bottom, you need these commitments to be such that if everyone followed them you would avoid the collective action problem and be in the world you want to be in. But if everyone followed this first principle, everyone would keep racing until they were sure they had enough lead time to work on safety while still being far out in front.

The commitment is strictly incompatible with a pause until we are sure superintelligence is safe.

If Anthropic wants to instead make conditional commitments that get them out of a collective action problem instead of encourage it, I think they need to do to two things:
- First, drop the clause about lead time. Simply say “if we have developed or will imminently develop a highly capable model, and we have clear evidence that no other competitor will soon develop such a model, then we will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety, delaying AI development and deployment as needed to achieve this.” More modestly, change the clause to say that you will pause until other companies are going to get ahead of Anthropic, to guarantee Anthropic doesn’t fall behind, rather than making it about guaranteeing Anthropic stays way out in front. This is at least compatible with a pause rather than a commitment not to pause even if your competitors do.
- Second, strengthen the commitment by saying that Anthropic will pause if other companies adopt the same conditional commitments.
These two changes would result in a commitment that genuinely helps resolve the collective action problem rather than making it worse. It would mean that if other companies agreed to the principles, all of them would refuse to develop a Highly Capable Model until Anthropic’s recommendations for industry were followed. Instead, the sentential logic of the current commitment is such that if all AI companies followed it they would simply race.

At the very least, something Anthropic can do that is robust to economic disincentives, investor pressures, etc., is to add the words “at least” to the final clause of their first commitment, so that it reads: “We will delay AI development and deployment as needed to achieve this, at least until and unless we no longer believe we have a significant lead.” This would at least allow for the possibility of a pause rather than binding them to the mast by requiring them to race.

Commitments below, and in Appendix A here.
March 13, 2026
Challenges of governing continually learning AI
Earlier this month Google Research dropped a new paper titled “Nested Learning,” which introduces a new architecture that they are calling “a new ML paradigm for continual learning”. And it looks like a real step towards new ML architectures that can learn and improve over time like humans do. What they’ve essentially done is trained modular neural networks where some of these modules act to immediately process the last tokens, some act as medium-term memory over dozens to hundreds of tokens, and some act as long-term knowledge storage. These networks are able to update themselves at inference time whenever they learn important new data. They call the architecture “HOPE” and it beats transformer-based architectures of equivalent size on a few validated memory tasks, like “needle-in-a-haystack” tasks where the model has to remember a specific idea or phrase dropped into a longer passage.

There is no indication yet that HOPE can support the development of a SOTA general-purpose model. The publicly known models with HOPE architecture only have up to 1.3 billion parameters, which is still two orders of magnitude short of even GPT-3. But continual learning is a major open problem that the field is trying to solve, with the recent “Definition of AGI” paper from Yoshua Bengio and others regarding memory storage as the only component of AGI that doesn’t have any real progress.

Further, AI companies have strong financial incentives to build systems that can update in real-time, learning on the job and making and storing new discoveries. Right now SOTA models can’t even play video games like Pokemon because they keep forgetting what they’ve achieved so far and running back to do things that they don’t realize they’ve already done. Memory storage is key for executing any long-term, multi-step project. And the better AI systems are at developing radically new capabilities the more valuable they are. So there is real incentive to create AI systems that are “self-improving” in a strong sense — not just speeding up machine learning but updating their weights in real time.

At first I expect improvements in continual learning to be marginal, and look a lot like the “in-context” learning that AI models already do when they read and interact with your prompt and recall specific things you’ve said. AI systems will be able to remember which gym leaders they’ve defeated in Pokemon, elements of voice and style you’ve taught them, and things they’ve learned about their customer base that will allow them to appropriately price goods and services. I expect it to be marginal since scaling a new technique typically takes time, and when new paradigms scale too quickly they are too chaotic and unwieldy to be usefully deployed until they’ve been sufficiently refined (see o-series reward hacking, for example). But paradigms can scale very fast, and there’s no telling when we will see AI systems with the learning capabilities of humans, i.e. good enough to go from near tabula rasa to a university professor.

Emerging Governance Challenges

Continual learning looks like it could pose major challenges to current paradigms in safety and governance. An architecture that can update its own weights as it learns information is a product with constantly changing capabilities. That’s not an inherent problem. Computers also have constantly changing capabilities if users are good enough at coding. But if capabilities can shift enough it makes it much harder to say anything that is reliably true about any particular AI system. On one day a model may fail to cross critical red lines but the next day we can’t say for sure. On one day a cybersecurity regime may be sufficiently hardened to deal with LLMs but on the next day the models have acquired new offensive capabilities. We already have this to some extent with existing models, just given how quickly new models roll out and how much we learn about them only after they have been deployed. But at least with those models we can take months to stress-test them and try out different augmentation techniques before taking them to market.

Specifically, continual learning raises challenges to governance paradigms like:
- Evaluations: We could see AI systems that have no static base layer that can be evaluated. 800 million users [this is probably too many, see edit at end of post] could have 800 million different models with different weights and slightly different capabilities.
- Model cards: Correspondingly, it will be harder to have model cards that accurately describe the range of capabilities that an AI model might have, or which reliably measure performance on tests and benchmarks.
- Alignment: Models that “unlearn” dangerous information could re-learn it. In interpretability, network maps that identify the features of a network’s neurons may last only a day. Scheming models may find ways to adversarially hide critical information in complex ways in their weights. Problems of emergent misalignment could intensify as models can change more drastically over time, not just updating once at fine-tuning or through in-context learning but iteratively.
- Safety mitigations: In general, it may be harder to determine whether a mitigation put in place for a specific safety problem stays in place across iterations.
- Systemic risk: Sectors like finance, cyber, military, and media that respond badly to a rapidly changing equilibrium of offensive and defensive capabilities have to confront this even more often, and with many more branching points as AI systems update their capabilities in many more ways.
- Corporate governance: With so much gain of function happening in the wild, how does an AI company decide when it is safe to deploy a model?
It’s totally possible I’m getting ahead of myself here, but I do currently think that the financial incentives of AI companies favor developing and deploying advanced models that can update their weights like HOPE in order to change their capabilities to adapt to various economically valuable tasks. If this can be achieved, I worry about a rupture to many of our current approaches.

I’d like to see more strategic thinking about governance of models with continual learning. Even if we don’t have those models now, we do have a good understanding of what AI companies are trying to do and the incentives they have. For now, I’ll close with four approaches currently in development that look like they could be part of the solution:
1. Turning evaluations and control into an “always-on verification layer” that can prove or at least assure various safety properties of a model as it changes and adapts to an environment and observe the model to see whether it does anything anomalous. A worry about this, though, is that verifying features of (e.g.) 800 million different models is likely to be way too compute-intensive to be feasible. I expect that we’ll need new technical approaches to the problem.
2. Red-teaming and stress-testing models against red lines under various gain-of-function conditions to find out if the capabilities AI models can acquire will be dangerous.
3. Starting to regulate AI systems with criminal law, holding companies liable when AI systems do things that would be illegal if humans did them (cf. law-following AI). The more AI systems learn and behave like humans do, the more using our evolved legal systems for dealing with human crime look appropriate.
4. Developing a parallel system of evolving defenses against systemic risk that can update in response to changing offensive capabilities of AI systems (cf. the approach of Red Queen Bio).
And — as always — energetic, technocratic, and adaptive governance.

EDITED TO ADD: One of my favorite things about blogging and X is getting to leverage Cunningham’s Law to learn new things. Gavin Leech points out that part of what makes inference cheap is prefix caching. But prefix caches are weight-specific so you cannot use them for multiple different sets of weights. This means that running inference on lots of different weights leads to a much larger (~>10x) cost. This is affordable for enterprise users but not likely to roll out to 800 million individuals without breakthroughs.
November 24, 2025