With artificial intelligence (AI) technology continues to evolve, the so-called "inference”AI Modelsbecame a research hotspot. These models are able to think about problems step-by-step like humans and are considered to be more capable than non-reasoning models in specific fields, such as physics. However.This advantage comes with high testing costs, making the ability to independently validate these models difficult.
According to data from Artificial Analysis, a third-party AI tester, evaluating OpenAI's o1 inference model on seven popular AI benchmarking(including MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500) would cost $2,767.05 (note: $20,191 at current exchange rates). Evaluating Anthropic's Claude 3.7 Sonnet, a "hybrid" inference model, cost $1,485.35 ($10,839 at the current exchange rate), compared to $344.59 ($2,514 at the current exchange rate) to test OpenAI's o3-mini-high. The cost of testing OpenAI's o3-mini-high was $1,485.35 ($10,839), compared to $344.59 ($2,514) for OpenAI's o3-mini-high. While some inference models are relatively inexpensive to test, such as evaluating OpenAI's o1-mini for $141.22 (current exchange rate of about RMB 1,030), the cost of testing inference models as a whole is still relatively high. To date, AI Analytics has spent about $5,200 ($37,945 at current exchange rates) evaluating about a dozen inference models, which is nearly double the $2,400 the company spent analyzing more than 80 non-inference models.
OpenAI's non-reasoning GPT-4o model, released in May 2024, cost only $108.85 to evaluate, compared to $81.41 for Claude 3.6 Sonnet, the non-reasoning predecessor to Claude 3.7 Sonnet. "George Cameron, co-founder of Analytics for Artificial Intelligence, told TechCrunch that the organization plans to increase its testing budget as more AI labs develop inference models. "At 'Artificial Intelligence Analytics,' we run hundreds of evaluations per month and have a sizable budget for that," Cameron said, "and we expect that to increase as models are released more frequently. "
"AI Analytics isn't the only organization facing rising AI testing costs, says Ross Taylor, CEO of AI startup General Reasoning, who recently spent $580 to evaluate Claude 3.7 Sonnet with about 3,700 unique cues. He recently spent $580 evaluating Claude 3.7 Sonnet with about 3,700 unique cues, and Taylor estimates that just one full test of MMLU Pro, a set of questions designed to assess a model's language comprehension, would cost more than $1,800. "We're moving toward a world where a lab reports x% results in a benchmark test in which they spend y amount of computational resources, but academics have far less than y," Taylor wrote in a recent post on X. " No one has been able to replicate these results."
So why is it so expensive to test inference models?The main reason is that they generate a lot of tokensThe token represents a fragment of the original text, e.g., the word "fantastic" split into the syllables "fan", "tas" and "tic. According to AI Analytics, OpenAI's o1 generated more than 44 million tokens in the company's benchmarks, roughly eight times the amount generated by GPT-4o. Most AI companies charge by the token, so costs can easily add up.
In addition, modern benchmarks typically elicit a large number of tokens from models because they contain questions that involve complex, multistep tasks.Jean-Stanislas Denain, a senior researcher at Epoch AI, says that this is because today's benchmarks are more complex, even though the number of questions per benchmark has decreased overall. decreased. "They typically try to assess a model's ability to perform real-world tasks, such as writing and executing code, browsing the Internet, and using a computer," Denain states. Deneen also noted that the most expensive models have seen their cost per token increase over time. For example, Anthropic's Claude 3 Opus, released in May 2024, was the most expensive model at the time, costing $75 per million output tokens. OpenAI's GPT-4.5 and o1-pro, released earlier this year, cost $150 and $600 per million output tokens, respectively.
"While the performance of models has improved over time and the cost of reaching a given level of performance has certainly dropped dramatically, you still need to pay more if you want to evaluate the biggest and best model at any given time," Deneen said. Many AI labs, including OpenAI, offer free or subsidized access to models to benchmarking organizations for testing purposes. But some experts say this can compromise the fairness of test results -- even without evidence of manipulation, the involvement of AI labs could itself compromise the integrity of assessment scores.