Studying weak-to-strong-generalisation using influence functions

A crucial reason that it is possible to train ML systems to outperform human experts in narrow domains such as protein folding or chess, is because for these well-defined problems, it is easy to produce a reliable reward signal. However, current techniques for aligning frontier models with human goals, such as human feedback, are only as good as the human’s ability to evaluate the model’s output. Safety concerns arise when models are trained in more complex domains which humans don’t yet know the answers to and where human feedback may therefore be unhelpful at directing the AI towards the intended goal. The aim of this project is to improve our understanding of a phenomenon known as weak-to-strong generalisation, which demonstrates that models are in fact able to generalize beyond human performance even when the training feedback is unreliable. Specifically, this project will test hypotheses for why weak-to-strong generalisation is possible by studying which training examples contribute most to the model’s predictions. Progress on these questions would have substantial implications for AI progress, as well as for approaches to aligning AI systems with human preferences.

Faculty Supervisor:

Roger Grosse

Student:

Partner:

University of Bath

Discipline:

Computer science

Sector:

Artificial Intelligence

University:

University of Toronto

Program: