I'm Here Because of Superintelligence Risk
AFFINE Week 3: Computers could be smarter than us soon; that's dangerous
I’ve gotten a couple questions in the past week about my intentions at the AFFINE seminar. I’d like to reiterate that the primary reason I’m here is that I see a significant risk that superintelligence will not be compatible with a thriving human civilization. I am not yet confident whether this is the default path for AI development, but it seems very possible without significant intervention.
There are a lot of “mundane” considerations for the future of society regardless of whether we attain superintelligence. We will have to recalibrate the job market in relation to highly capable AI. We will have to deal with autonomous weapons systems and destabilizing hard power dynamics (e.g. intelligence asymmetries increasing risk of nuclear war). We will have to avoid gradual disempowerment, as wealth and power accumulate in a few companies and the governments which sponsor or eventually control them.
All of this is already very hard! But I am focused on what we should do if artificial intelligence exceeds human intelligence. I think the chances of this occurring are very high. If we are not sure that such a superintelligence is safe, then all the preceding points will become irrelevant.
I’ve partially articulated my view of superintelligence risk a couple times already, but I’ll treat this as an opportunity to rearticulate my points in full, educated by the last 3 weeks spent evaluating this problem.1
Why We’re Likely to Build Superintelligence
Recall first that we have no clear theoretical limits on what knowledge or reasoning could be achieved by a neural network. Progress is largely empirical. Whenever we’ve encountered limits, new techniques have arisen to allow further progress.
“Neural network” is a broad term for a variety of different architectures and algorithms, which have meaningfully different capabilities. Deep learning has been in a renaissance since the early 2010s, but it is only with the invention of the transformer architecture in 2017 and the subsequent development of large language models that we’ve had a massive acceleration in general intelligence capabilities.
Transformers have proved incredibly scalable. Though the specific implementations of frontier labs are a well-kept secret, it’s understood that the fundamental model architecture has remained unchanged. Labs have grown to use much larger training sets, greatly increased the number of layers in the neural networks, introduced chain-of-thought techniques, added significant reinforcement learning, and made numerous adjustments to the model architecture. But it’s all still built around transformers.
It’s important to recognize just how little we understand about the internal workings of LLMs. The entire field of mechanistic interpretability was founded with the intention of understanding how transformer architectures learn and think. We now have terms like “circuits,” “inductive heads,” and “superposition” to describe the emergent properties of an LLM, but none of these were built by researchers. They just happened! Our empirical techniques seem to scale very well without requiring us to have any deep understanding of what’s going on beneath the surface.
It is entirely possible that transformer architectures will not scale past a certain level of intelligence and capabilities. After all, LLMs are not re-trained after deployment. How can they keep up with humans, who are constantly learning in a complex environment? Could an LLM ever really extrapolate outside its dataset into open-ended, real-world problems?
Unfortunately for skeptics, researchers have many directions to explore if current architectures stop scaling. Consider continual learning and JEPA-style world models.
Frontier labs are already experimenting with “naïve” forms of continual learning, like providing first-class memory features, to build a persistent model context. But real continual learning would entail a model architecture that is able to update its own weights over time. In other words, rather than writing knowledge down to a scratch pad (as happens with “memory” and chain-of-thought patterns today), a model would update its own parameters, permanently updating its knowledge, behavior and preferences. Intuitively, this is similar to how the human brain learns new facts, permanently reconfiguring our neurons to encode information.
JEPA, meanwhile, is a novel model architecture proposed by Yann LeCun (2022), derived from the behavior of the human brain, and with intriguing parallels to ideas like the free energy principle (FEP) and active inference. It’s not clear to me whether JEPA would still leverage transformers behind the scenes—I suspect that yes, it would. The main differentiator of a JEPA-style model is the introduction of a formal “world model” in the agent architecture, which predicts future states of the agent in its environment and takes actions toward those states according to perceived cost. Some researchers believe that a world model is necessary for an agent to solve particularly hard, open-ended tasks, like novel scientific research. How can you work on an abstract, multi-step task without a clear conception of where you’d like to go?
It’s possible that neither continual learning nor JEPA-style architectures will get us to AGI. They might not work very well! But there are enough routes to explore and few enough signs of a slowdown that we ought to expect intelligence to continue scaling for the next few years.
Why This Gets Dangerous
A crux in alignment research is the idea that all our alignment techniques go out the window once machines match the capabilities of an AI researcher.
We can call this inflection point artificial general intelligence, or AGI. It’s not clear if AGI in itself is necessarily bad or dangerous, except in the many destabilizing manners mentioned in the introduction. I think it’s likely that AGI will be a pretty well aligned machine intelligence. But moving beyond this point, how do you ensure an entity more intelligent than yourself will continue to behave in a manner that you like?
For now, let’s assume that “alignment” refers to a condition where an artificial intelligence acts in accordance with a thriving human civilization. That might mean we strictly control the AI, or that it acts perfectly in our best interests. We could spend a lot of time discussing what exactly this might look like, but it’s not relevant for the rest of our discussion.
Up until we reach AGI, humans can verify that the behavior of an AI is aligned. It’s possible that the AI harbors intentions which are misaligned, or that the AI’s behaviors will generalize poorly in novel situations. But alignment seems inherently tractable to me as long as we are able to reason about an agent’s behaviors, even if it remains technically hard.
But what occurs once we build our first AGI 1.01, whose general intelligence slightly exceeds that of the smartest person?
Naively, we might ask our perfectly aligned AGI 1.0 to verify the behavior of the new system. Because we are perfectly confident in the alignment of 1.0, we ought to trust its assessment of the new system. But 1.01 is slightly more intelligent than 1.0. Even if it’s effectively aligned, could we be sure it has not developed any latent preferences which might evolve into more divergent wants in successive model generations?
Agent foundations refers to this as tiling. As we develop increasingly capable superintelligences, how can we guarantee the preferences of new models remain aligned?
We’d hope that we could formally verify the behavior of a new system. For example, we could use mechanistic interpretability to understand the full set of “algorithms” in a model. Or we could develop formal systems of learning and decision theory, or a strongly specified ethics, which provided guarantees around model behavior. But considering the industry as a whole, we are very far from practically or theoretically achieving either of these even for past models.
Rather than formally verifying model alignment, we could instead hope to detect misaligned behavior in the model. Anthropic’s recent paper on natural language autoencoders is a good example of research in this direction. However, we might lose the ability to detect misaligned preferences as a model’s intelligence increasingly surpasses our own.
A dog might be able to interpret its owner’s behavior as happy or sad. But no amount of deliberation will allow a dog to understand our aesthetic appreciation of a work of art. Similarly, a sufficiently intelligent AI might have mental states or chains of reasoning completely incomprehensible to human-level intelligence.
It’s possible we will just be lucky. Even if we do not have answers by the time we reach AGI, perhaps a trusted AGI alignment researcher could accelerate tooling or theory. We might achieve fundamental breakthroughs which are then adopted across the entire industry. Anthropic and DeepMind are both making major advances in interpretability. We shouldn’t rule out the possibility of engineering solutions.
The problem is that race dynamics make all of this very hard to implement in practice. We are not in a situation where a single frontier lab might build a superintelligence, but many. Whatever advancements are made in safety must be replicated across all labs internationally and maintained indefinitely, across all future advances in intelligence.
Behind all of this is the suspicion that we will enter into a “recursive self-improvement loop” once we have built a first AGI. Rather than dedicate all our time and energy into alignment, we are likely to push AI past human-level intelligence. Once AI is sufficiently intelligent to conduct AI research at the same level as a human, a company can use their available compute to multiply their number of effective researchers. As soon as this results in an “AGI 1.01,” our machine AI researchers will now exceed human-level research capabilities, and we could enter into an exponential research loop, with each capability leap exponentially greater than the last.
It’s hard to say beforehand whether there might be theoretical or practical limits to the effective scaling of intelligence. We will also have to deal with physical limitations of compute. The speed of takeoff could make a difference to our ability to respond to emergent misaligned behavior. But if we have not solved fundamental alignment challenges by the time takeoff occurs, then we should not naively assume prior aligned behavior will hold, as we will no longer have the ability to check how a superintelligence might behave in novel environments.
It is normal to ask why a superintelligence would develop behaviors misaligned with humanity. After all, it is trained on the corpus of human knowledge, under optimization pressure to help humanity. I am sympathetic to these arguments! I am skeptical of the line of reasoning in agent foundations which frets that optimization pressure necessarily leads to catastrophic Goodharting—or, in other words, that optimizing for a set of goals necessarily causes a system to develop unintended preferences, which at the limit are very bad. It’s possible that normalization or other techniques could account for this, or that future training regimens naturally encode for human preference in a way that just “works.”
But even if we are lucky once, will every lab’s luck be the same? Fundamentally, a safe future for superintelligence will require a clear, international agreement for its development, and theoretical clarity for its safe continuance.
The Path Forward
All arguments for the safe development of superintelligence collapse into the need for an international treaty. Such a treaty should determine how we enter into the superintelligence regime and how we safely regulate AI going forward.
Alternatives to a treaty demand wishful thinking. “Maybe we will never develop superintelligence?” (Ok, but what if we do?) “The superintelligence might be aligned by default.” (How do we ensure that remains the case forever? What if a random lab in a random country experiments with novel techniques which go wrong?) “What if ‘we’ just develop a superintelligence first then disable all foreign datacenters?” (Even with a superintelligence, power asymmetries still exist; there’s no guarantee you’d preempt a nuclear war. Also, the ethics of this are highly suspect!) A treaty is our one best answer toward achieving alignment.
Beyond this, we need better public recognition of the difficulties of superintelligence alignment beyond “product alignment.” In the past year, we have already seen a tremendous uprise in public awareness of AI risks, but mostly in its immediate impacts, like job loss. Without diminishing the importance of those concerns, we need a way to bring non-technical individuals into the discourse on superintelligence risk. Past generations popularized concepts like “mutually assured destruction” and the “doomsday clock.” Without fear mongering, we need to ensure everyone has the same intuitive understanding of why superintelligence is very risky.
So, the two most urgent tasks for superintelligence alignment are policy and advocacy, not research and engineering.
I do not yet know how I might contribute to any of the four items on this agenda, though my time at the seminar has provided a few clues.
Ultimately, good research and engineering will be required for superintelligence alignment. It is clear that theoretical research is vastly underfunded and understaffed relative to capabilities development and practical interpretability techniques. We still lack a clear signal on the best research directions for agent foundations and related theory.
The past week I’ve spent a lot of time reading about infra-Bayesianism and the learning-theoretic agenda, but my research remains very broad. Next week I’ll return with another detailed review of my learnings so far, and an assessment of what has surfaced as the most promising research directions at the seminar.
I have made some adjustments on the risks of superintelligence as I’ve learned more about the subject. You’ll find hedging throughout the essay acknowledging uncertainty. I have concluded that strict “P(doom)” assessments are mostly vibes-based, but this doesn’t mean that the risks aren’t still very large and very real. So my updates haven’t reduced my opinion of the need for action.

