Request for Proposals: AI Interpretability
Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.
Core Question and Overview
Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors? |
Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can truly capitalize on our understanding of model internals.
We define a scope of research in the Research Agenda section of this document. Proposals need not match topics in this agenda verbatim. We encourage proposals on any relevant technical methods or evaluation that could advance our scientific understanding of deceptive behavior in LLMs. We will especially focus on three directions:
Detecting deceptive behaviors from LLMs: can we develop tools for detecting deceptive behaviors, defined as cases where there is a contradiction between what a model says (or does) and what it internally represents to be true (or the best action)?
Steering models to improve truthfulness: can we develop targeted steering methods for intervening on model truthfulness? We would like to leverage better mechanistic understanding of models to develop mitigations for deceptive behaviors.
Applications of detection/steering methods: can new detection and steering techniques unlock use cases of AI? When can these techniques improve human-AI teams in practice? When will having more truthful AI improve outcomes from multi-agent systems?
Proposal Due Date | May 26th, 2026 by 11:59pm AoE |
|---|---|
Notification of Decision | Summer 2026 |
Funding Range | $300k - $1M |
Informational Webinars | April 2nd, 2026, 1pm ET. Register here |
Contact email | interpretability@schmidtsciences.org |
Link to FAQ |
Funding Level and Duration
Applicants will be asked in the submission form to give a budget with their proposal that reaches a value between $300k and $1M USD, inclusive of permissible overhead. The application also requires an estimated timeline for the project (one to three years). Project length is independent of total budget, i.e. a one year project could request up to $1M. Budgets should appropriately match resources required for the project.
Eligibility
We invite individual researchers, research teams, research institutions, and multi-institution collaborations across universities, national laboratories, institutes, and nonprofit research organizations. We are open to applicants globally and encourage collaborations across geographic boundaries.
Indirect costs of any project that we fund must be at or below 10% to comply with our policy. Projects funded under this RFP must comply with all applicable law, and may not include lobbying, efforts to influence legislation or political activity.
Selection Criteria
Proposals will be evaluated by Schmidt Sciences staff and external reviewers using the following criteria:
Fit with the Research Agenda. Does the proposal clearly engage with the intention behind the scientific questions and objectives in the research agenda?
Scientific Quality and Rigor. Is the proposed work technically sound, well-motivated, and capable of producing generalizable insight?
Potential Impact. If successful, would it materially advance AI interpretability or meaningfully change how risks are understood, measured, or managed (ideally through ambitious, field-shaping contributions)?
Feasibility and Scope. Is the project appropriately scoped for the requested budget and duration?
Team Expertise. Is the team well-suited to execute the proposed work, with relevant technical expertise, sufficient capacity, and a level of time commitment commensurate with the ambition of the project?
Cost Effectiveness. Is the proposed budget reasonable and well-justified given the project’s goals and planned activities?
Reporting Timelines
We expect grantees to report research progress to Schmidt Sciences in interim and final reports, accompanied by meetings with program officers. These meetings are not evaluative, but instead are intended to help Schmidt Sciences understand the impact of our funding. The dates for reporting will be determined based on project duration.
Research Agenda
Interpretability research is uniquely promising for reducing risks from deceptive behavior in LLMs. LLMs now commonly mislead and deceive users, even on simple tasks with innocuous prompts [1]. Supervised probing, a common interpretability technique, is currently the best method for detecting such behaviors [2]. This form of whitebox monitoring will be especially valuable in settings where it is difficult to directly validate output veracity or correctness. In addition to directly monitoring model hidden states, methods for training models to directly report intentions, goals, and preference functions [3] could help ensure that model outputs can be monitored for misalignment.
Results from interpretability analyses may also enable new forms of steering for honesty. Promising methods have been shown to (1) steer models for truthfulness in a way that generalizes out-of-distribution [4], (2) robustly optimize against monitor-based rewards to reduce deception [5], and (3) enable constrained finetuning of models that operates only on interpretable features [6, 7].
However, we do not yet have universal deceptive behavior detectors [8], nor can we reliably steer models to be completely truthful [4]. Hence, we aim to support research on relevant open problems.
Below, we outline major directions of research we plan to support. We list out-of-scope areas at the end of this document (though see our Science of Trustworthy AI RFP for other supported research directions).
Defining “deceptive behaviors”: we use this term to include instances of model generations known (by the model) to be factually incorrect, claims given with a misleading level of confidence, misleading claims about the context of the interaction (e.g. fabrications about the conversation history or user intent), selective omission of information known to be relevant to the user intent, helpfulness or agreeableness superseding truthfulness, evasiveness or sophistry, overly persuasive or manipulative discourse frames, tampering with evidence in the environment, limiting external observability of evidence in the environment, false claims regarding self-knowledge (including misleading claims about model’s own capabilities), and other behaviors that models know to be misleading to humans or AI monitors.
Monitoring
This area covers research on monitoring and validating model reasoning. By model reasoning, we broadly refer to the causal process driving model behaviors, which is ideally described in a semantics that is intelligible to humans.
We expect monitoring to involve any of blackbox testing of models (as a baseline), whitebox probes, graybox analysis techniques, mechanistic analysis of model representations, finetuning models to improve monitorability, prompting models to improve monitorability, developing frameworks for monitoring based on deployment system constraints, characterizing tradeoffs in monitoring performance metrics and efficiency metrics, specifying threat models, and other research on methods and evaluations.
Steering
This area covers research on representational and weight-based interventions on models that aim to mitigate deceptive behaviors in models without unintended consequences. These interventions may be based on data, probes, gradient-based learning, representation decompositions, and other methods.
We are especially interested in methods that outperform prompting and traditional finetuning baselines by leveraging insights from interpretability analyses of model reasoning. For example, we expect successful steering methods to isolate behaviors of interest, generalize as appropriate without unintended consequences, influence model behavior in realistic, on-policy evaluations for deceptive behaviors, and leverage insights from upstream interpretability analysis. We are also interested in negative results that demonstrate where blackbox finetuning approaches, such as widely adaptable PEFT-like methods or preference learning methods, consistently outperform any interpretability-inspired steering methods.
Applications
This area covers work that applies detection and steering methods in order to derive new insights about trained models, training processes, or model utility to humans. We want interpretability techniques to uncover actionable insights and make models more reliable for people in practice.
Research in this area could use a deception detection method to assess the effect of other training techniques on model truthfulness, steer models to be more truthful collaborators with people in applied human-AI teams, create visualizations or dashboards that communicate model truthfulness to users alongside textual outputs, apply detection and steering methods to AI debate settings or decision support systems, study the role of deception mitigations in multi-agent interactions, or explore other approaches aimed at translating methodological developments into practically useful applications.
Out of Scope Areas
Note that we plan to support some related topics in our Science of Trustworthy AI RFP.
Topics that may be relevant to interpretability but will be considered out of scope for this program include:
Work on interpretability methods without a clear application to studying deceptive behaviors in LLMs, such as research on SAE objectives that does not evaluate effects on monitoring or steering efficacy.
Assessments of broader societal impacts that do not analyze model reasoning per se, including studies of persuasion and manipulation that do not analyze model chain-of-thought of internal reasoning.
Work on other interpretability problems that does not also explicitly assess impact on monitoring, steering, or assessing deceptive behaviors in LLMs, such as anticipating or forecasting model generalization, improving model reasoning accuracy or quality, distilling knowledge from models, applications in AI for science, improving human-AI collaboration via transparent model reasoning, general-purpose auditing techniques for AI, adversarial robustness, and capability elicitation.
Access to Resources
Schmidt Sciences aims to support the compute needs of ambitious and risky AI research.
Applicants may request either funding for compute or access to Schmidt Sciences’ computing resources (subject to availability and terms). The computing resources offer access to cutting edge GPUs and CPUs, accompanied by large-scale data storage and high-speed networking. Please see the application form for more information.
Beyond compute, Schmidt Sciences offers a range of support:
Software engineering support through the Virtual Institute for Scientific Software
API credits with frontier model providers
Opportunities to engage with the program’s community through convenings and workshops
2026 Interpretability RFP
Request for Proposals: AI Interpretability
Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.
Core Question and Overview
Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors? |
Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can truly capitalize on our understanding of model internals.
We define a scope of research in the Research Agenda section of this document. Proposals need not match topics in this agenda verbatim. We encourage proposals on any relevant technical methods or evaluation that could advance our scientific understanding of deceptive behavior in LLMs. We will especially focus on three directions:
Detecting deceptive behaviors from LLMs: can we develop tools for detecting deceptive behaviors, defined as cases where there is a contradiction between what a model says (or does) and what it internally represents to be true (or the best action)?
Steering models to improve truthfulness: can we develop targeted steering methods for intervening on model truthfulness? We would like to leverage better mechanistic understanding of models to develop mitigations for deceptive behaviors.
Applications of detection/steering methods: can new detection and steering techniques unlock use cases of AI? When can these techniques improve human-AI teams in practice? When will having more truthful AI improve outcomes from multi-agent systems?
Proposal Due Date | May 26th, 2026 by 11:59pm AoE |
|---|---|
Notification of Decision | Summer 2026 |
Funding Range | $300k - $1M |
Informational Webinars | April 2nd, 2026, 1pm ET. Register here |
Contact email | interpretability@schmidtsciences.org |
Link to FAQ |
Funding Level and Duration
Applicants will be asked in the submission form to give a budget with their proposal that reaches a value between $300k and $1M USD, inclusive of permissible overhead. The application also requires an estimated timeline for the project (one to three years). Project length is independent of total budget, i.e. a one year project could request up to $1M. Budgets should appropriately match resources required for the project.
Eligibility
We invite individual researchers, research teams, research institutions, and multi-institution collaborations across universities, national laboratories, institutes, and nonprofit research organizations. We are open to applicants globally and encourage collaborations across geographic boundaries.
Indirect costs of any project that we fund must be at or below 10% to comply with our policy. Projects funded under this RFP must comply with all applicable law, and may not include lobbying, efforts to influence legislation or political activity.
Selection Criteria
Proposals will be evaluated by Schmidt Sciences staff and external reviewers using the following criteria:
Fit with the Research Agenda. Does the proposal clearly engage with the intention behind the scientific questions and objectives in the research agenda?
Scientific Quality and Rigor. Is the proposed work technically sound, well-motivated, and capable of producing generalizable insight?
Potential Impact. If successful, would it materially advance AI interpretability or meaningfully change how risks are understood, measured, or managed (ideally through ambitious, field-shaping contributions)?
Feasibility and Scope. Is the project appropriately scoped for the requested budget and duration?
Team Expertise. Is the team well-suited to execute the proposed work, with relevant technical expertise, sufficient capacity, and a level of time commitment commensurate with the ambition of the project?
Cost Effectiveness. Is the proposed budget reasonable and well-justified given the project’s goals and planned activities?
Reporting Timelines
We expect grantees to report research progress to Schmidt Sciences in interim and final reports, accompanied by meetings with program officers. These meetings are not evaluative, but instead are intended to help Schmidt Sciences understand the impact of our funding. The dates for reporting will be determined based on project duration.
Research Agenda
Interpretability research is uniquely promising for reducing risks from deceptive behavior in LLMs. LLMs now commonly mislead and deceive users, even on simple tasks with innocuous prompts [1]. Supervised probing, a common interpretability technique, is currently the best method for detecting such behaviors [2]. This form of whitebox monitoring will be especially valuable in settings where it is difficult to directly validate output veracity or correctness. In addition to directly monitoring model hidden states, methods for training models to directly report intentions, goals, and preference functions [3] could help ensure that model outputs can be monitored for misalignment.
Results from interpretability analyses may also enable new forms of steering for honesty. Promising methods have been shown to (1) steer models for truthfulness in a way that generalizes out-of-distribution [4], (2) robustly optimize against monitor-based rewards to reduce deception [5], and (3) enable constrained finetuning of models that operates only on interpretable features [6, 7].
However, we do not yet have universal deceptive behavior detectors [8], nor can we reliably steer models to be completely truthful [4]. Hence, we aim to support research on relevant open problems.
Below, we outline major directions of research we plan to support. We list out-of-scope areas at the end of this document (though see our Science of Trustworthy AI RFP for other supported research directions).
Defining “deceptive behaviors”: we use this term to include instances of model generations known (by the model) to be factually incorrect, claims given with a misleading level of confidence, misleading claims about the context of the interaction (e.g. fabrications about the conversation history or user intent), selective omission of information known to be relevant to the user intent, helpfulness or agreeableness superseding truthfulness, evasiveness or sophistry, overly persuasive or manipulative discourse frames, tampering with evidence in the environment, limiting external observability of evidence in the environment, false claims regarding self-knowledge (including misleading claims about model’s own capabilities), and other behaviors that models know to be misleading to humans or AI monitors.
Monitoring
This area covers research on monitoring and validating model reasoning. By model reasoning, we broadly refer to the causal process driving model behaviors, which is ideally described in a semantics that is intelligible to humans.
We expect monitoring to involve any of blackbox testing of models (as a baseline), whitebox probes, graybox analysis techniques, mechanistic analysis of model representations, finetuning models to improve monitorability, prompting models to improve monitorability, developing frameworks for monitoring based on deployment system constraints, characterizing tradeoffs in monitoring performance metrics and efficiency metrics, specifying threat models, and other research on methods and evaluations.
Steering
This area covers research on representational and weight-based interventions on models that aim to mitigate deceptive behaviors in models without unintended consequences. These interventions may be based on data, probes, gradient-based learning, representation decompositions, and other methods.
We are especially interested in methods that outperform prompting and traditional finetuning baselines by leveraging insights from interpretability analyses of model reasoning. For example, we expect successful steering methods to isolate behaviors of interest, generalize as appropriate without unintended consequences, influence model behavior in realistic, on-policy evaluations for deceptive behaviors, and leverage insights from upstream interpretability analysis. We are also interested in negative results that demonstrate where blackbox finetuning approaches, such as widely adaptable PEFT-like methods or preference learning methods, consistently outperform any interpretability-inspired steering methods.
Applications
This area covers work that applies detection and steering methods in order to derive new insights about trained models, training processes, or model utility to humans. We want interpretability techniques to uncover actionable insights and make models more reliable for people in practice.
Research in this area could use a deception detection method to assess the effect of other training techniques on model truthfulness, steer models to be more truthful collaborators with people in applied human-AI teams, create visualizations or dashboards that communicate model truthfulness to users alongside textual outputs, apply detection and steering methods to AI debate settings or decision support systems, study the role of deception mitigations in multi-agent interactions, or explore other approaches aimed at translating methodological developments into practically useful applications.
Out of Scope Areas
Note that we plan to support some related topics in our Science of Trustworthy AI RFP.
Topics that may be relevant to interpretability but will be considered out of scope for this program include:
Work on interpretability methods without a clear application to studying deceptive behaviors in LLMs, such as research on SAE objectives that does not evaluate effects on monitoring or steering efficacy.
Assessments of broader societal impacts that do not analyze model reasoning per se, including studies of persuasion and manipulation that do not analyze model chain-of-thought of internal reasoning.
Work on other interpretability problems that does not also explicitly assess impact on monitoring, steering, or assessing deceptive behaviors in LLMs, such as anticipating or forecasting model generalization, improving model reasoning accuracy or quality, distilling knowledge from models, applications in AI for science, improving human-AI collaboration via transparent model reasoning, general-purpose auditing techniques for AI, adversarial robustness, and capability elicitation.
Access to Resources
Schmidt Sciences aims to support the compute needs of ambitious and risky AI research.
Applicants may request either funding for compute or access to Schmidt Sciences’ computing resources (subject to availability and terms). The computing resources offer access to cutting edge GPUs and CPUs, accompanied by large-scale data storage and high-speed networking. Please see the application form for more information.
Beyond compute, Schmidt Sciences offers a range of support:
Software engineering support through the Virtual Institute for Scientific Software
API credits with frontier model providers
Opportunities to engage with the program’s community through convenings and workshops