Related projects
Discover more projects across a range of sectors and discipline — from AI to cleantech to social innovation.
A crucial reason that it is possible to train ML systems to outperform human experts in narrow domains such as protein folding or chess, is because for these well-defined problems, it is easy to produce a reliable reward signal. However, current techniques for aligning frontier models with human goals, such as human feedback, are only as good as the human’s ability to evaluate the model’s output. Safety concerns arise when models are trained in more complex domains which humans don’t yet know the answers to and where human feedback may therefore be unhelpful at directing the AI towards the intended goal. The aim of this project is to improve our understanding of a phenomenon known as weak-to-strong generalisation, which demonstrates that models are in fact able to generalize beyond human performance even when the training feedback is unreliable. Specifically, this project will test hypotheses for why weak-to-strong generalisation is possible by studying which training examples contribute most to the model’s predictions. Progress on these questions would have substantial implications for AI progress, as well as for approaches to aligning AI systems with human preferences.
Roger Grosse
University of Bath
Computer science
Artificial Intelligence
University of Toronto
Globalink Research Award
Discover more projects across a range of sectors and discipline — from AI to cleantech to social innovation.
Find the perfect opportunity to put your academic skills and knowledge into practice!
Find ProjectsThe strong support from governments across Canada, international partners, universities, colleges, companies, and community organizations has enabled Mitacs to focus on the core idea that talent and partnerships power innovation — and innovation creates a better future.