There are 2 reasons why an AI system may fail:
Today, we align large large models by fine-tuning them with specifically curated datasets. (See InstructGPT)
How can we align a model to do the intended task; if task is difficult for humans to evaluate?
RL from human feedback (RLHF) doesn’t apply here because model might be fooling us in ways that are hard for us to detect, ie. it won’t scale.
Can we train the AI to assist humans in evaluation?
Recursive Reward Modeling (RRM) is an extension to the RLHF.
So, instead of aligning the model to the difficult tasks, we will align it to evaluation assistance tasks.
Think of book summarization. Instead of trying to summarize all of the book, we need to summarize each chapter, then get a summary from these.
How far can we push this idea? ie. What is the largest set of tasks that we can align our models on by training evaluation assistance?
We can also leverage our model’s ability to generalize to make expensive evaluation a lot cheaper.
We would like to reach a ground truth using human evaluations. (this is not feasible)
We would like to automate the cognitive labor required for evaluation; so human evaluators can focus more on preference input more.