1. You write that OpenAI should be showing error bars on these evaluations. What would those error bars capture? Do you have in mind that they'd run the model many times on the same problems (with some nonzero temperature) to capture the uncertainty in the total % correct that comes from the fact that the output is probabilistic? Or do you have in mind something deeper and closer to the way we usually think about inference from random sample, i.e. there's some underlying "population" of problems and we're interested in the AI's accuracy in this more general population, but we're only observing a random sample of those problems in a benchmark?
2. Data leakage doesn't actually seem so easy to define -- whether or something is data leakage or not seems to depend on whether we're claiming the AI **should** be able to generalize. In some cases this question seems clearer than others. If the problem "x = 2y, solve for y" is in the training set and the AI can solve it, but not "x = 3y, solve for y", is that an example of data leakage? What about "x = 3y+1, solve for y" or "If x is three times as big as y, then y is ___ as big as x"?
You can imagine very easily generating stochastic versions of these benchmarks that perturb certain small details every time they're run to ensure that the AI has never seen the *exact* problem, but for practical purposes isn't this essentially still an instance of data leakage if our claim is the broader "AI should be able to solve a broad class of linear algebra problems" rather than the narrower "AI should be able to solve any problem of the form x = Cy"? And zooming back to the real world defining what the "broad class of problems" we want the benchmark to be representative of seems very difficult too. Maybe this is what benchmarks like SWE-bench are trying to solve --- but e.g. if the AI learns how to fix a specific bug from a GitHub issue in the training set and then is able to solve that bug in many different contexts, then (1) isn't this still useful even if it might be considered an example of data leakage and (2) how far do those contexts have to get from the original code so that it becomes "true knowledge" rather than data leakage"?
Thank you for reading, and for the thoughtful questions!
1. You are right to point out the two options -- variance in terms of stochasticity at inference time, and population variance among the questions. I think the latter is more valuable and is what I was referring to here. Recommend this paper which goes into detail on how to generate error bars for LM evals: https://arxiv.org/abs/2411.00640 (though this method mostly makes sense on bigger benchmarks, with ~1000 questions or more).
2. This is a fantastic question -- and I think it's still an open one. In this post I’m mostly talking about straight up memorization of the training set, cases where a model is training on benchmark questions directly (I think this is happening a surprising amount). But I think the gray area is also really interesting -- I haven't seen a lot of LLM papers that evaluate on a benchmark and also quantify how similar the questions in the benchmark are to the training set (that's also just a technically annoying problem when your training set is the whole internet). Your second point is a great one, and is what I was alluding to in footnote 2 -- it’s a big can of worms. But I definitely agree that in some cases, we basically don’t care, because the thing we’re training on (or trivial variations of it) is the real world problem. But then we also shouldn’t be surprised when our models are ‘brittle’ to small changes in our use cases.
This was great, thanks! Can I ask 2 questions:
1. You write that OpenAI should be showing error bars on these evaluations. What would those error bars capture? Do you have in mind that they'd run the model many times on the same problems (with some nonzero temperature) to capture the uncertainty in the total % correct that comes from the fact that the output is probabilistic? Or do you have in mind something deeper and closer to the way we usually think about inference from random sample, i.e. there's some underlying "population" of problems and we're interested in the AI's accuracy in this more general population, but we're only observing a random sample of those problems in a benchmark?
2. Data leakage doesn't actually seem so easy to define -- whether or something is data leakage or not seems to depend on whether we're claiming the AI **should** be able to generalize. In some cases this question seems clearer than others. If the problem "x = 2y, solve for y" is in the training set and the AI can solve it, but not "x = 3y, solve for y", is that an example of data leakage? What about "x = 3y+1, solve for y" or "If x is three times as big as y, then y is ___ as big as x"?
You can imagine very easily generating stochastic versions of these benchmarks that perturb certain small details every time they're run to ensure that the AI has never seen the *exact* problem, but for practical purposes isn't this essentially still an instance of data leakage if our claim is the broader "AI should be able to solve a broad class of linear algebra problems" rather than the narrower "AI should be able to solve any problem of the form x = Cy"? And zooming back to the real world defining what the "broad class of problems" we want the benchmark to be representative of seems very difficult too. Maybe this is what benchmarks like SWE-bench are trying to solve --- but e.g. if the AI learns how to fix a specific bug from a GitHub issue in the training set and then is able to solve that bug in many different contexts, then (1) isn't this still useful even if it might be considered an example of data leakage and (2) how far do those contexts have to get from the original code so that it becomes "true knowledge" rather than data leakage"?
Thank you for reading, and for the thoughtful questions!
1. You are right to point out the two options -- variance in terms of stochasticity at inference time, and population variance among the questions. I think the latter is more valuable and is what I was referring to here. Recommend this paper which goes into detail on how to generate error bars for LM evals: https://arxiv.org/abs/2411.00640 (though this method mostly makes sense on bigger benchmarks, with ~1000 questions or more).
2. This is a fantastic question -- and I think it's still an open one. In this post I’m mostly talking about straight up memorization of the training set, cases where a model is training on benchmark questions directly (I think this is happening a surprising amount). But I think the gray area is also really interesting -- I haven't seen a lot of LLM papers that evaluate on a benchmark and also quantify how similar the questions in the benchmark are to the training set (that's also just a technically annoying problem when your training set is the whole internet). Your second point is a great one, and is what I was alluding to in footnote 2 -- it’s a big can of worms. But I definitely agree that in some cases, we basically don’t care, because the thing we’re training on (or trivial variations of it) is the real world problem. But then we also shouldn’t be surprised when our models are ‘brittle’ to small changes in our use cases.
Awesome stuff!