Benchmarking Costs Soar as AI 'Reasoning' Models Emerge

With artificial intelligence (AI) technology continues to evolve, the so-called "inferenceAI Modelsbecame a research hotspot. These models are able to think about problems step-by-step like humans and are considered to be more capable than non-reasoning models in specific fields, such as physics. However.This advantage comes with high testing costs, making the ability to independently validate these models difficult.

Benchmarking Costs Soar as AI 'Reasoning' Models Emerge

According to data from Artificial Analysis, a third-party AI tester, evaluating OpenAI's o1 inference model on seven popular AI benchmarking(including MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500) would cost $2,767.05 (note: $20,191 at current exchange rates). Evaluating Anthropic's Claude 3.7 Sonnet, a "hybrid" inference model, cost $1,485.35 ($10,839 at the current exchange rate), compared to $344.59 ($2,514 at the current exchange rate) to test OpenAI's o3-mini-high. The cost of testing OpenAI's o3-mini-high was $1,485.35 ($10,839), compared to $344.59 ($2,514) for OpenAI's o3-mini-high. While some inference models are relatively inexpensive to test, such as evaluating OpenAI's o1-mini for $141.22 (current exchange rate of about RMB 1,030), the cost of testing inference models as a whole is still relatively high. To date, AI Analytics has spent about $5,200 ($37,945 at current exchange rates) evaluating about a dozen inference models, which is nearly double the $2,400 the company spent analyzing more than 80 non-inference models.

OpenAI's non-reasoning GPT-4o model, released in May 2024, cost only $108.85 to evaluate, compared to $81.41 for Claude 3.6 Sonnet, the non-reasoning predecessor to Claude 3.7 Sonnet. "George Cameron, co-founder of Analytics for Artificial Intelligence, told TechCrunch that the organization plans to increase its testing budget as more AI labs develop inference models. "At 'Artificial Intelligence Analytics,' we run hundreds of evaluations per month and have a sizable budget for that," Cameron said, "and we expect that to increase as models are released more frequently. "

"AI Analytics isn't the only organization facing rising AI testing costs, says Ross Taylor, CEO of AI startup General Reasoning, who recently spent $580 to evaluate Claude 3.7 Sonnet with about 3,700 unique cues. He recently spent $580 evaluating Claude 3.7 Sonnet with about 3,700 unique cues, and Taylor estimates that just one full test of MMLU Pro, a set of questions designed to assess a model's language comprehension, would cost more than $1,800. "We're moving toward a world where a lab reports x% results in a benchmark test in which they spend y amount of computational resources, but academics have far less than y," Taylor wrote in a recent post on X. " No one has been able to replicate these results."

So why is it so expensive to test inference models?The main reason is that they generate a lot of tokensThe token represents a fragment of the original text, e.g., the word "fantastic" split into the syllables "fan", "tas" and "tic. According to AI Analytics, OpenAI's o1 generated more than 44 million tokens in the company's benchmarks, roughly eight times the amount generated by GPT-4o. Most AI companies charge by the token, so costs can easily add up.

In addition, modern benchmarks typically elicit a large number of tokens from models because they contain questions that involve complex, multistep tasks.Jean-Stanislas Denain, a senior researcher at Epoch AI, says that this is because today's benchmarks are more complex, even though the number of questions per benchmark has decreased overall. decreased. "They typically try to assess a model's ability to perform real-world tasks, such as writing and executing code, browsing the Internet, and using a computer," Denain states. Deneen also noted that the most expensive models have seen their cost per token increase over time. For example, Anthropic's Claude 3 Opus, released in May 2024, was the most expensive model at the time, costing $75 per million output tokens. OpenAI's GPT-4.5 and o1-pro, released earlier this year, cost $150 and $600 per million output tokens, respectively.

"While the performance of models has improved over time and the cost of reaching a given level of performance has certainly dropped dramatically, you still need to pay more if you want to evaluate the biggest and best model at any given time," Deneen said. Many AI labs, including OpenAI, offer free or subsidized access to models to benchmarking organizations for testing purposes. But some experts say this can compromise the fairness of test results -- even without evidence of manipulation, the involvement of AI labs could itself compromise the integrity of assessment scores.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Google plans to fuse Gemini and Veo models to create an all-in-one AI assistant

2025-4-14 10:37:53

Information

Kunlun Wanwei Launches Skywork-OR1 Series of Models: Fully Open, Free to Use, and Completely Open Source

2025-4-14 10:42:12

Search