By Sophie Bridgers, Rishub Jain, Rory Greig, and Rohin Shah
Based on work by the Rater Assist Team: Vladimir Mikulik, Sophie Bridgers, Tian Huey Teh, Rishub Jain, Rory Greig, Lili Janzer (randomized order, equal contributions)
Human oversight is critical for ensuring that Artificial Intelligence (AI) models remain safe and aligned to human values. But AI systems are rapidly advancing in capabilities and are being used to complete ever more complex tasks, making it increasingly challenging for humans to verify AI outputs and provide high-quality feedback. How can we ensure that humans can continue to meaningfully evaluate AI performance? An avenue of research to tackle this problem is “Amplified Oversight” (also called “Scalable Oversight”), which aims to develop techniques to use AI to amplify humans’ abilities to oversee increasingly powerful AI systems, even if they eventually surpass human capabilities in particular domains.
With this level of advanced AI, we could use AI itself to evaluate other AIs (i.e., AI raters), but this comes with drawbacks (see Section IV: The Elephant in the Room). Importantly, humans and AIs have complementary strengths and weaknesses. We should thus, in principle, be able to leverage these complementary abilities to generate an oversight signal for model training, evaluation, and monitoring that is stronger than what we could get from human raters or AI raters alone. Two promising mechanisms for harnessing human-AI complementarity to improve oversight are:
- Rater Assistance, in which we give human raters access to an AI rating assistant that can critique or point out flaws in an AI output or automate parts of the rating task, and
- Hybridization, in which we combine judgments from human raters and AI raters working in isolation based on predictions about their relative rating ability per task instance (e.g., based on confidence).
The design of Rater Assistance and/or Hybridization protocols that enable human-AI complementarity is challenging. It requires grappling with complex questions such as how to pinpoint the unique skills and knowledge that humans or AIs possess, how to identify when AI or human judgment is more reliable, and how to effectively use AI to improve human reasoning and decision-making without leading to under- or over-reliance on the AI. These are fundamentally questions of Human-Computer Interaction (HCI), Cognitive Science, Psychology, Philosophy, and Education. Luckily, these fields have explored these same or related questions, and AI safety can learn from and collaborate with them to address these sociotechnical challenges. On our team, we have worked to expand our interdisciplinary expertise to make progress on Rater Assistance and Hybridization for Amplified Oversight.
Read the rest of the full blog here!