Discussion about this post

User's avatar
Aakaash Rao's avatar

This was great, thanks! Can I ask 2 questions:

1. You write that OpenAI should be showing error bars on these evaluations. What would those error bars capture? Do you have in mind that they'd run the model many times on the same problems (with some nonzero temperature) to capture the uncertainty in the total % correct that comes from the fact that the output is probabilistic? Or do you have in mind something deeper and closer to the way we usually think about inference from random sample, i.e. there's some underlying "population" of problems and we're interested in the AI's accuracy in this more general population, but we're only observing a random sample of those problems in a benchmark?

2. Data leakage doesn't actually seem so easy to define -- whether or something is data leakage or not seems to depend on whether we're claiming the AI **should** be able to generalize. In some cases this question seems clearer than others. If the problem "x = 2y, solve for y" is in the training set and the AI can solve it, but not "x = 3y, solve for y", is that an example of data leakage? What about "x = 3y+1, solve for y" or "If x is three times as big as y, then y is ___ as big as x"?

You can imagine very easily generating stochastic versions of these benchmarks that perturb certain small details every time they're run to ensure that the AI has never seen the *exact* problem, but for practical purposes isn't this essentially still an instance of data leakage if our claim is the broader "AI should be able to solve a broad class of linear algebra problems" rather than the narrower "AI should be able to solve any problem of the form x = Cy"? And zooming back to the real world defining what the "broad class of problems" we want the benchmark to be representative of seems very difficult too. Maybe this is what benchmarks like SWE-bench are trying to solve --- but e.g. if the AI learns how to fix a specific bug from a GitHub issue in the training set and then is able to solve that bug in many different contexts, then (1) isn't this still useful even if it might be considered an example of data leakage and (2) how far do those contexts have to get from the original code so that it becomes "true knowledge" rather than data leakage"?

Andrew Flax's avatar

Awesome stuff!

1 more comment...

Ready for more?