Project summary
Transluce is a non-profit AI lab working to ensure that AI oversight scales with AI capabilities. This means developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society. Reliable, independent evaluations of AI systems are key to both safety and innovation, and Transluce is building tools to make this practical at scale and across domains.
So far, we have built scalable systems for monitoring AI agents, testing their behaviors, and interpreting their inner workings, and used these to study issues like sycophancy, self-harm, and reward hacking. Our systems have been used to build and improve popular agent evaluations like HAL and SWE-bench, by developers to align frontier models like Claude 4, and by governments to evaluate risks to public safety. We have been publicly endorsed by leading researchers from across the field including Wojciech Zaremba (OpenAI co-founder), Ethan Perez (Anthropic), and Percy Liang (Stanford).
Our fundraising target is $11M for this year. We are fundraising from a variety of sources, and are targeting $2M through Manifund. This funding will allow us to build new AI evaluation platforms and methods, advance public accountability for AI systems, and apply our research to a range of pressing AI risks, from manipulation and deception to mental health and child safety. Donations of all sizes help inform our approach and advance our efforts.
What is your theory of change?
Today's complex AI systems are difficult to understand—even experts struggle to predict behaviors like blackmail, sycophancy, or spiral personas. At the same time, most analyses of AI systems are done behind closed doors by the same labs deploying them, presenting inherent conflicts of interest. This combination is dangerous: we face technologies whose behavior we cannot reliably forecast, and we lack trusted public channels to even assess the risks.
The solution: use AI to understand AI, in public. We need scalable technology for understanding and overseeing AI systems, backed by AI itself, so that oversight can scale with AI capabilities and adoption. This technology should be developed in public, so that it can be publicly vetted and drive third-party accountability. We address this in two parts:
Scaling AI oversight. A fundamental takeaway from the last decade of AI is the bitter lesson: simple methods that leverage compute and data outperform specialized, hand-crafted methods. This is driving rapid progress in AI capabilities, but how do we leverage the bitter lesson to also drive understanding and oversight of AI?
Our key insight is that AI systems generate vast amounts of data—agent transcripts, diverse behaviors across prompts, neuron activations, etc. This scale overwhelms humans, but we can use it to train AI-backed tools. By training AI agents to understand this data and explain it to humans, we build specialized, superhuman AI assistants. Rather than being broadly superhuman, our tools are superhuman specifically at helping humans oversee other AI systems, for instance by catching strange or unwanted behaviors at scale, uncovering such issues before models are deployed, and informing reliable fixes that avoid similar problems in the future.
Advancing public accountability. Society cannot rely on AI developers to grade their own homework. We need a robust ecosystem of independent actors capable of systematically understanding AI systems and providing accountability.
Our open-source, AI-backed tools provide a technology stack that powers this public accountability. Independent evaluators can use it to oversee frontier AI systems with state-of-the-art, publicly vetted tools, which increases quality, reduces conflicts of interest, and helps drive adoption of best practices.
In summary, we leverage the bitter lesson to create specialized, superhuman systems for understanding and overseeing AI, and use this technology to drive an industry standard for trustworthy oversight. We build this technology using philanthropic funding, so that we can design it openly and in the public interest.
How will this funding be used?
Funding would go roughly into the following categories:
60% to scale our existing research, including hiring new research engineers, expanding our compute budget 10x, and building a dedicated infrastructure team.
15% to apply our methods for evaluating model releases and risks of acute public concern, from deception and manipulation to mental health and child safety.
10% to governance and public accountability, including by establishing best practices and standards for evaluation, expanding the evaluator ecosystem, and providing technical analysis to governments and policymakers.
10% to kickstart new efforts such as fine-tuning generalization (e.g. emergent misalignment, character training) or multi-agent failures (e.g. AI-induced psychosis, parasitic AI, implicit/explicit collusion across AI systems).
5% to overhead (e.g., office, operations, legal).
More broadly, donations help us in three ways:
In the short term, they provide us enough runway to grow our earned revenue (currently about 20% of our funding) into a sustainable business.
In the long term, they help subsidize public service work, allowing us to consistently prioritize the public interest, even when commercial incentives diverge.
On all time horizons, they accelerate our growth, allowing us to attract senior technical talent and make longer-term bets.
Donor support today ensures that Transluce can both move quickly and remain aligned with its public mission.
Who is on your team?
Our team combines deep expertise in AI, governance, and org-building with strong engineering firepower.
Jacob Steinhardt (Co-founder and CEO) runs a leading academic AI safety group at UC Berkeley, wrote Concrete Problems in AI Safety (the foundational research roadmap for the field), and designed MMLU, the most common benchmark for measuring capabilities of AI systems.
Sarah Schwettmann (Co-founder and Chief Scientist) has previously founded two successful non-profits, built the first large-scale pipeline using interpretability agents underlying our tools, and is a highly respected mentor at MIT who can draw top talent.
Conrad Stosz (Head of Governance) previously led the U.S. Center for Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He has also led AI policy development in a range of positions in the White House, Congress, and Department of Defense, building on his prior experience as a machine learning engineer.
Technical Staff: We have a strong team of engineers and scientists, ranging from former start-up founders to Google DeepMind research scientists to IMO gold medalists. We regularly compete with top AI labs for talent and win.
We are supported by advisors comprising leading voices in AI, including Yoshua Bengio and Percy Liang.
What's your track record on similar projects?
Since Transluce launched in October 2024, we have:
Built and scaled an agent evaluation platform. Docent, our framework for scalably analyzing agent behavior, has been used by over 25 organizations, including frontier AI labs (Anthropic, DeepMind, Thinking Machines), third party safety orgs (METR, Redwood, Apollo, Palisade), government evaluators, AI start-ups (Penrose), large enterprises (Bridgewater Associates), and academic labs (Princeton, UIUC). It was also used as part of Claude 4's pre-deployment safety analysis and is integrated with SWE-bench, one of the most-used AI agent benchmarks.
Developed novel methods for investigating AI behaviors. We introduced the idea of trainable investigator agents and developed a new reinforcement learning method, PRBO, for eliciting unexpected but realistic low-probability behaviors from language models. Our method discovered new behaviors in open-weight models, including a propensity to recommend self-harm to users. We used investigators trained with PRBO to conduct a demonstration audit of an open-weight frontier model for behaviors specified by policy experts.
Conducted high-impact red-teaming. Our pre-deployment testing of OpenAI's o3 model uncovered persistent tendencies of o3 to fabricate actions it took to fulfill user requests. Our work received coverage in TechCrunch, ArsTechnica, Yahoo News, and TechInAsia and our report was viewed more on social media than the official o3 release. We have also shown that specialized small models can red-team frontier models, automatically uncovering CBRN jailbreaks for all frontier systems we tested.
Advanced the state of the art for interpreting model internals, through public tools such as our natural-language neuron descriptions (state-of-the-art description quality, 12000+ downloads) and the Monitor interpretability interface. We have also developed new basic methods for interpretability, including training models to directly verbalize the meaning of their internal activations, uncovering latent inferences about users, and discovering sparse neuron circuits.
Strengthened the evaluator ecosystem. We established the AI Evaluator Forum, which brings together leading researchers to establish shared standards and foster a healthy ecosystem of independent evaluators operating in the public interest. Together, we released AEF-1, a new standard for ensuring minimum levels of access, transparency, and independence for third-party evaluations. Transluce also provides technical expertise to help policymakers, insurers, and enterprises address AI risks using independent evaluations, including serving as a contractor to the EU AI Office.
What are the most likely causes and outcomes if this project fails?
Most forms of failure would look like "being moderately successful but not doing enough to move the needle". For instance, building good tools that get user traction but that do not fundamentally change the way we understand AI systems; or building a good engine for scalable oversight but failing to outpace AI development.