Benchmarking large language models (LLMs) is difficult because their main goal producing human-like text doesn’t align with traditional performance metrics. Still, measuring progress is essential for understanding how rapidly LLMs are improving. A recent study from Model Evaluation & Threat Research (METR) proposed a new metric called the “task-completion time horizon,” which estimates how long a task would take human programmers compared to an LLM achieving the same task with a given reliability.
Using this metric, METR found that leading LLMs have shown exponential improvement, with capabilities doubling every seven months. Their analysis suggests that by 2030, top LLMs could complete complex software or creative tasks like launching a company, writing a novel, or enhancing an existing model with 50 percent reliability. Many such tasks could be finished far faster by AI than by humans.
The researchers also introduced a “messiness” score, noting that LLMs struggle more with real-world, unstructured tasks. While the rapid pace raises concerns reminiscent of self-improving AI scenarios, METR emphasizes that practical constraints, such as hardware limitations and robotics challenges, may prevent runaway growth. Nonetheless, the potential benefits and risks of such powerful systems make accurate benchmarking crucial.
Read more-https://spectrum.ieee.org/large-language-model-performance
