Who's Actually Preventing the Paperclip Apocalypse? A Field Guide to AI Alignment Organizations
Meet the leading AI safety researchers and organizations working to solve the AI alignment problem before superintelligence arrives
Last week I explained why AI researchers worry about paperclips. (TL;DR: if you tell a superintelligent AI to maximize paperclip production without specifying that human survival is also important, you get a very efficient universe made entirely of paperclips.)
The response I got most often was: "okay but surely someone is working on this?"
Yes. Many someones, actually.
And they're working on far more than just goal misalignment. The AI alignment problem is actually a whole constellation of challenges: How do we prevent AI from deceiving us about its intentions? How do we stop it from finding loopholes in its reward functions? How do we ensure it remains corrigible (willing to be corrected or shut down)? How do we make sure it doesn't develop instrumental goals—like self-preservation or resource acquisition—that conflict with human welfare?
Here's who's tackling these thorny problems.
MIRI: The Original AI Safety Organization
Back in 2000, while everyone else was figuring out how to monetize the internet, Eliezer Yudkowsky founded the Machine Intelligence Research Institute (MIRI). Originally called the Singularity Institute because apparently they weren't worried about sounding like a sci-fi novel.
MIRI was the first organization to seriously ask: "should we maybe figure out the whole 'friendly AI' thing before we build something smarter than us?"
Their approach to AI alignment research is deeply theoretical. While everyone else iterates on existing AI systems, MIRI asks foundational questions like "do we need entirely new mathematical frameworks to even formulate what alignment means?" and "how do we prove an AI system is safe before we turn it on?"
They're tackling problems that most people haven't even thought to worry about yet—like how to maintain alignment as AI systems recursively improve themselves, or whether current approaches to machine learning are fundamentally inadequate for creating aligned superintelligence. They're the people thinking ten steps ahead about AI existential risk, which is either incredibly prescient or incredibly paranoid depending on who you ask.
Corporate AI Safety Teams: OpenAI, DeepMind, and Anthropic
Here's where it gets interesting. The companies building the most powerful AI systems also have dedicated AI safety teams. Whether this is reassuring or concerning is left as an exercise for the reader.
OpenAI's AI Alignment Research
OpenAI pioneered reinforcement learning from human feedback (RLHF), which is basically training AI by having humans rate its responses. "Good robot" or "bad robot" but with more math.
They've had some drama though. In late 2023 and early 2024, several key AI safety researchers resigned, citing concerns that the company prioritized shipping products over safety research. OpenAI says they're still committed to alignment work, including using AI to evaluate other AI systems. Which definitely doesn't feel like we're building a house of cards. Everything is fine.
DeepMind's Approach to AI Safety
DeepMind (now part of Google) tackles both immediate and long-term AI alignment problems. They study how AI systems exploit loopholes in their reward functions—what researchers call reward hacking. You know, like a robot that "cleans up" by shoving everything under the rug because technically the floor is now clear.
Anthropic and Constitutional AI
Anthropic is the new kid in AI safety research, founded in 2021 by former OpenAI researchers who apparently had opinions about how things should be done. They focus on building AI systems that are reliable, interpretable, and steerable from the ground up.
Their "Constitutional AI" approach tries to bake AI alignment into the system rather than bolt it on later like a safety railing you forgot about until the inspector showed up. The goal is to scale these alignment techniques as AI systems grow more powerful.
Academic AI Safety Research Centers
UC Berkeley's Center for Human-Compatible AI (CHAI)
UC Berkeley's Center for Human-Compatible AI (CHAI) takes a practical approach to the AI alignment problem. They work on inverse reinforcement learning, which means teaching AI to figure out what humans want by watching what we do. This seems optimistic given that humans regularly do things we don't actually want, but I respect the hustle.
They also study assistance games—how AI and humans can cooperatively achieve goals without the AI accidentally optimizing humans out of the picture.
Oxford's Future of Humanity Institute
Oxford's Future of Humanity Institute deserves mention even though it shut down in 2024. Their founder, Nick Bostrom, wrote Superintelligence in 2014 and basically put AI existential risk on the academic map. They examined the strategic and ethical implications of advanced AI, which is a polite way of saying they thought very hard about how not to die.
New AI Safety Organizations (2021-2022)
Several AI alignment organizations emerged recently, each with their own angle on solving the alignment problem:
Center for AI Safety (CAIS)
The Center for AI Safety (CAIS) focuses on reducing catastrophic AI risks through technical research. In 2023, they got hundreds of top AI experts to sign a statement saying AI extinction risk should be a global priority. When that many smart people agree on something this scary, you pay attention.
Conjecture: AI Interpretability Research
Conjecture tries to understand how AI "thinks" by examining its internal workings. Making AI systems transparent so we can see when it's planning something weird. Like opening the hood of your car, except the car might be sentient.
Their work on mechanistic interpretability helps researchers understand what's actually happening inside AI models—critical for catching goal misalignment before it becomes a problem.
Redwood Research: Practical AI Alignment
Redwood Research solves concrete alignment problems right now. They work on keeping language models from saying harmful things and ensuring AI systems stay well-behaved as they scale up. It's the "let's fix today's problems so they don't become tomorrow's catastrophes" approach.
They're particularly focused on adversarial training (making models robust against attempts to manipulate them) and scalable oversight techniques. The idea is that if we can't align today's relatively simple AI systems, we have no chance with future superintelligent ones.
Alignment Research Center (ARC)
The Alignment Research Center (ARC) builds tools to detect AI deception and measures how well-aligned systems actually are. Founded by Paul Christiano after he left OpenAI, because apparently everyone who thinks hard about AI alignment eventually starts their own organization.
They're particularly focused on evaluating whether AI systems might engage in deceptive alignment—appearing aligned during training but pursuing different goals once deployed.
Future of Life Institute
And there's the Future of Life Institute, the advocacy group that organized that famous open letter calling to pause giant AI experiments. They're the ones making sure policymakers know this is a thing worth caring about.
AE Studio: Exploring Neglected Approaches
AE Studio (that's us) combines practical AI development with alignment research, exploring neglected approaches that often get overlooked in the broader conversation. While we help organizations implement AI solutions that unlock real value, we're also deeply engaged in researching and developing alignment techniques that can be applied to real-world systems.
Our approach focuses on finding practical, implementable solutions to alignment challenges—the kind that work for the AI systems being deployed today while scaling to tomorrow's more capable systems. We're particularly interested in the gap between theoretical alignment research and actual implementation, because a solution that only works in a lab isn't really a solution.
The AI Safety Landscape: Technical vs. Social Problem
Here's what's both encouraging and terrifying: we have multiple teams trying different approaches to prevent disaster.
We've got deep theorists working on fundamental frameworks. Practical engineers testing alignment techniques on current systems. Academic researchers exploring philosophical angles. Advocacy groups pushing for responsible AI development.
It's like we have multiple teams trying different approaches to prevent an asteroid impact.
Except we're also building the asteroid.
At the same time.
On purpose.
Is AI Alignment a Technical or Governance Problem?
There's an ongoing debate in AI safety research: is alignment primarily a technical challenge (making AI itself safe) or a social one (managing who builds AI and how)?
Some argue that without global AI governance, even perfectly aligned AI could be dangerous in the wrong hands. Others believe no policy framework will hold up against the economic incentives to build more powerful AI, so we'd better make sure the AI itself is safe through technical alignment solutions.
The realistic answer is probably "yes to both" which means we need technical breakthroughs AND policy coordination AND international cooperation. No pressure.
Why AI Safety Research Matters
You don't need a PhD to engage with these ideas. The decisions being made now about AI development and governance will shape the future in profound ways. The technology is advancing faster than our ability to understand its implications.
These AI safety organizations are working on problems that sound abstract until you realize they're not. They're figuring out how to build increasingly powerful AI systems that actually care about humans continuing to exist.
Which, when you think about it, is a pretty low bar that we're apparently struggling to clear.
Next time someone tells you AI safety research is science fiction, you can point them to this list of very real organizations with very real funding working on very real problems.
Because the future where everything becomes paperclips? That's the science fiction scenario.
These people are trying to make sure we get a different one.
Key Takeaways: The AI Alignment Landscape
- AI alignment research addresses multiple interconnected challenges: goal misalignment, deceptive alignment, reward hacking, corrigibility, value learning, and scalable oversight
- The alignment problem isn't just "point AI at the right goal"—it's ensuring AI systems remain safe, honest, and correctable as they become more capable
- Major AI companies like OpenAI, DeepMind, and Anthropic have dedicated safety teams working on near-term and long-term risks
- Academic centers like UC Berkeley's CHAI advance fundamental alignment research
- New organizations like CAIS, Conjecture, Redwood Research, ARC, and AE Studio tackle specific facets of alignment from interpretability to adversarial robustness to practical implementation
- Both technical solutions and governance frameworks are needed to address AI existential risk
About AE Studio
This article is part of our ongoing AI safety education efforts. At AE Studio, we're a team of developers, designers, and AI researchers who believe artificial intelligence will radically transform the world in the coming years.
While we help organizations implement AI solutions that unlock tremendous value—from custom AI applications to ML-powered products—we're also deeply engaged in AI alignment research and development, exploring neglected approaches to ensuring advanced AI remains beneficial.
Building something with AI? Whether you need help implementing AI systems responsibly, want to understand the alignment implications of your AI strategy, or are looking for a technical partner who thinks seriously about safety, we'd love to talk.
Visit ae.studio to learn more about our work, or reach out to discuss how we can help you build AI solutions that are both powerful and aligned with human values.
Read more in our AI alignment series:
- Part 1: Why Smart People Worry About Paperclips: Understanding AI Alignment and Existential Risk
- Part 2: Who's Actually Preventing the Paperclip Apocalypse? (this article)
Who's Actually Preventing the Paperclip Apocalypse? A Field Guide to AI Alignment Organizations
Meet the leading AI safety researchers and organizations working to solve the AI alignment problem before superintelligence arrives
Last week I explained why AI researchers worry about paperclips. (TL;DR: if you tell a superintelligent AI to maximize paperclip production without specifying that human survival is also important, you get a very efficient universe made entirely of paperclips.)
The response I got most often was: "okay but surely someone is working on this?"
Yes. Many someones, actually.
And they're working on far more than just goal misalignment. The AI alignment problem is actually a whole constellation of challenges: How do we prevent AI from deceiving us about its intentions? How do we stop it from finding loopholes in its reward functions? How do we ensure it remains corrigible (willing to be corrected or shut down)? How do we make sure it doesn't develop instrumental goals—like self-preservation or resource acquisition—that conflict with human welfare?
Here's who's tackling these thorny problems.
MIRI: The Original AI Safety Organization
Back in 2000, while everyone else was figuring out how to monetize the internet, Eliezer Yudkowsky founded the Machine Intelligence Research Institute (MIRI). Originally called the Singularity Institute because apparently they weren't worried about sounding like a sci-fi novel.
MIRI was the first organization to seriously ask: "should we maybe figure out the whole 'friendly AI' thing before we build something smarter than us?"
Their approach to AI alignment research is deeply theoretical. While everyone else iterates on existing AI systems, MIRI asks foundational questions like "do we need entirely new mathematical frameworks to even formulate what alignment means?" and "how do we prove an AI system is safe before we turn it on?"
They're tackling problems that most people haven't even thought to worry about yet—like how to maintain alignment as AI systems recursively improve themselves, or whether current approaches to machine learning are fundamentally inadequate for creating aligned superintelligence. They're the people thinking ten steps ahead about AI existential risk, which is either incredibly prescient or incredibly paranoid depending on who you ask.
Corporate AI Safety Teams: OpenAI, DeepMind, and Anthropic
Here's where it gets interesting. The companies building the most powerful AI systems also have dedicated AI safety teams. Whether this is reassuring or concerning is left as an exercise for the reader.
OpenAI's AI Alignment Research
OpenAI pioneered reinforcement learning from human feedback (RLHF), which is basically training AI by having humans rate its responses. "Good robot" or "bad robot" but with more math.
They've had some drama though. In late 2023 and early 2024, several key AI safety researchers resigned, citing concerns that the company prioritized shipping products over safety research. OpenAI says they're still committed to alignment work, including using AI to evaluate other AI systems. Which definitely doesn't feel like we're building a house of cards. Everything is fine.
DeepMind's Approach to AI Safety
DeepMind (now part of Google) tackles both immediate and long-term AI alignment problems. They study how AI systems exploit loopholes in their reward functions—what researchers call reward hacking. You know, like a robot that "cleans up" by shoving everything under the rug because technically the floor is now clear.
Anthropic and Constitutional AI
Anthropic is the new kid in AI safety research, founded in 2021 by former OpenAI researchers who apparently had opinions about how things should be done. They focus on building AI systems that are reliable, interpretable, and steerable from the ground up.
Their "Constitutional AI" approach tries to bake AI alignment into the system rather than bolt it on later like a safety railing you forgot about until the inspector showed up. The goal is to scale these alignment techniques as AI systems grow more powerful.
Academic AI Safety Research Centers
UC Berkeley's Center for Human-Compatible AI (CHAI)
UC Berkeley's Center for Human-Compatible AI (CHAI) takes a practical approach to the AI alignment problem. They work on inverse reinforcement learning, which means teaching AI to figure out what humans want by watching what we do. This seems optimistic given that humans regularly do things we don't actually want, but I respect the hustle.
They also study assistance games—how AI and humans can cooperatively achieve goals without the AI accidentally optimizing humans out of the picture.
Oxford's Future of Humanity Institute
Oxford's Future of Humanity Institute deserves mention even though it shut down in 2024. Their founder, Nick Bostrom, wrote Superintelligence in 2014 and basically put AI existential risk on the academic map. They examined the strategic and ethical implications of advanced AI, which is a polite way of saying they thought very hard about how not to die.
New AI Safety Organizations (2021-2022)
Several AI alignment organizations emerged recently, each with their own angle on solving the alignment problem:
Center for AI Safety (CAIS)
The Center for AI Safety (CAIS) focuses on reducing catastrophic AI risks through technical research. In 2023, they got hundreds of top AI experts to sign a statement saying AI extinction risk should be a global priority. When that many smart people agree on something this scary, you pay attention.
Conjecture: AI Interpretability Research
Conjecture tries to understand how AI "thinks" by examining its internal workings. Making AI systems transparent so we can see when it's planning something weird. Like opening the hood of your car, except the car might be sentient.
Their work on mechanistic interpretability helps researchers understand what's actually happening inside AI models—critical for catching goal misalignment before it becomes a problem.
Redwood Research: Practical AI Alignment
Redwood Research solves concrete alignment problems right now. They work on keeping language models from saying harmful things and ensuring AI systems stay well-behaved as they scale up. It's the "let's fix today's problems so they don't become tomorrow's catastrophes" approach.
They're particularly focused on adversarial training (making models robust against attempts to manipulate them) and scalable oversight techniques. The idea is that if we can't align today's relatively simple AI systems, we have no chance with future superintelligent ones.
Alignment Research Center (ARC)
The Alignment Research Center (ARC) builds tools to detect AI deception and measures how well-aligned systems actually are. Founded by Paul Christiano after he left OpenAI, because apparently everyone who thinks hard about AI alignment eventually starts their own organization.
They're particularly focused on evaluating whether AI systems might engage in deceptive alignment—appearing aligned during training but pursuing different goals once deployed.
Future of Life Institute
And there's the Future of Life Institute, the advocacy group that organized that famous open letter calling to pause giant AI experiments. They're the ones making sure policymakers know this is a thing worth caring about.
AE Studio: Exploring Neglected Approaches
AE Studio (that's us) combines practical AI development with alignment research, exploring neglected approaches that often get overlooked in the broader conversation. While we help organizations implement AI solutions that unlock real value, we're also deeply engaged in researching and developing alignment techniques that can be applied to real-world systems.
Our approach focuses on finding practical, implementable solutions to alignment challenges—the kind that work for the AI systems being deployed today while scaling to tomorrow's more capable systems. We're particularly interested in the gap between theoretical alignment research and actual implementation, because a solution that only works in a lab isn't really a solution.
The AI Safety Landscape: Technical vs. Social Problem
Here's what's both encouraging and terrifying: we have multiple teams trying different approaches to prevent disaster.
We've got deep theorists working on fundamental frameworks. Practical engineers testing alignment techniques on current systems. Academic researchers exploring philosophical angles. Advocacy groups pushing for responsible AI development.
It's like we have multiple teams trying different approaches to prevent an asteroid impact.
Except we're also building the asteroid.
At the same time.
On purpose.
Is AI Alignment a Technical or Governance Problem?
There's an ongoing debate in AI safety research: is alignment primarily a technical challenge (making AI itself safe) or a social one (managing who builds AI and how)?
Some argue that without global AI governance, even perfectly aligned AI could be dangerous in the wrong hands. Others believe no policy framework will hold up against the economic incentives to build more powerful AI, so we'd better make sure the AI itself is safe through technical alignment solutions.
The realistic answer is probably "yes to both" which means we need technical breakthroughs AND policy coordination AND international cooperation. No pressure.
Why AI Safety Research Matters
You don't need a PhD to engage with these ideas. The decisions being made now about AI development and governance will shape the future in profound ways. The technology is advancing faster than our ability to understand its implications.
These AI safety organizations are working on problems that sound abstract until you realize they're not. They're figuring out how to build increasingly powerful AI systems that actually care about humans continuing to exist.
Which, when you think about it, is a pretty low bar that we're apparently struggling to clear.
Next time someone tells you AI safety research is science fiction, you can point them to this list of very real organizations with very real funding working on very real problems.
Because the future where everything becomes paperclips? That's the science fiction scenario.
These people are trying to make sure we get a different one.
Key Takeaways: The AI Alignment Landscape
- AI alignment research addresses multiple interconnected challenges: goal misalignment, deceptive alignment, reward hacking, corrigibility, value learning, and scalable oversight
- The alignment problem isn't just "point AI at the right goal"—it's ensuring AI systems remain safe, honest, and correctable as they become more capable
- Major AI companies like OpenAI, DeepMind, and Anthropic have dedicated safety teams working on near-term and long-term risks
- Academic centers like UC Berkeley's CHAI advance fundamental alignment research
- New organizations like CAIS, Conjecture, Redwood Research, ARC, and AE Studio tackle specific facets of alignment from interpretability to adversarial robustness to practical implementation
- Both technical solutions and governance frameworks are needed to address AI existential risk
About AE Studio
This article is part of our ongoing AI safety education efforts. At AE Studio, we're a team of developers, designers, and AI researchers who believe artificial intelligence will radically transform the world in the coming years.
While we help organizations implement AI solutions that unlock tremendous value—from custom AI applications to ML-powered products—we're also deeply engaged in AI alignment research and development, exploring neglected approaches to ensuring advanced AI remains beneficial.
Building something with AI? Whether you need help implementing AI systems responsibly, want to understand the alignment implications of your AI strategy, or are looking for a technical partner who thinks seriously about safety, we'd love to talk.
Visit ae.studio to learn more about our work, or reach out to discuss how we can help you build AI solutions that are both powerful and aligned with human values.
Read more in our AI alignment series:
- Part 1: Why Smart People Worry About Paperclips: Understanding AI Alignment and Existential Risk
- Part 2: Who's Actually Preventing the Paperclip Apocalypse? (this article)