Alignment
Alignment: models that do what we intend.
As models get more capable, the gap between what we ask for and what we actually want is harder to close. We work on training and oversight that keep models doing what we intend, and correctable when they don't.
- 1.1Scalable oversight for tasks humans can't fully check.
- 1.2Honesty and principled abstention under uncertainty.
- 1.3Corrigibility and resistance to specification gaming.
