This is one area where the advancements in deep learning seem to be outpacing agreed upon standards. Due to decades of interest in artificial intelligence, the academic community had developed a good set of benchmarks to measure and understand the progress of deep learning models. However, the jumps in model quality have come in such a short amount of time that most of these benchmarks have become saturated. There are some popular leaderboards, but models have made so much progress that moving to the top of the leaderboard means one needs to do custom things to fit to the nuances of the dataset that the leaderboard is using as a surrogate for model quality. The equivalence that one can draw is with the 100 meters sprint in Olympics. Even though the intent of identifying the fastest man was that the person would be a great sprinter. However, to win the gold medal, one has to focus on a lot of other things apart from being a good runner, like getting off the blocks quickly, pushing the neck and body forward near the finish line, adjusting to wind speed, etc. Similarly, for some of these leaderboards, models sometimes have to fit to the noise in the data to go to the top. In general, Goodhart’s law kicks-in – when a measure becomes a target it ceases to be a good measure.
We are seeing new benchmarks being produced at regular intervals. But very often they are overwhelmed by model advancements. There have been instances where human parity is achieved even before the dataset paper is officially published in a conference. In general, this is one area where we need serious investment towards meaningful benchmarks so that progress could be measured and directed in the right direction. Some of the practical benchmarks that we use are real world production datasets. For example, in question and answering, in real production datasets, we see challenges like various flavors of spelling errors that users type, implicit and explicit questions for which they are seeking answers, punctuation is often missing, grammatical correctness is not guaranteed, code mixing, etc. One could artificially generate such sets similar to what SQUAD has done. But having the production datasets provides natural distributions across these challenging scenarios that are real and if a new model improves related metrics, there is a high likelihood of making real world impact.