AI programming ability which is strong? Ali Tongyi Thousand Questions Qwen pushes CodeElo benchmark, OpenAI o1-mini wins championship over 90% human programmers

January 4, 2012 - AliThousand Questions on Tongyi Qwen's newest CodeElo benchmarking test, by comparing it to the humanprogrammerA comparative Elo rating system to assess the level of programming in the Large Language Model (LLM).

Project Background

One of the AI scenario applications of big language modeling is to generate and complement code, except that there are many challenges in assessing the real capabilities of programming at this stage.

Existing benchmarks, including LiveCodeBench and USACO, have limitations, lack robust private test cases, do not support specialized judgment systems, and often use inconsistent execution environments.

CodeElo: Leveraging CodeForces for a More Accurate LLM Evaluation System

Note: The Qwen research team, in an effort to address these challenges, hasIntroduced the CodeElo Benchmark Test, designed to assess the level of programming competition at LLM using the Elo rating system compared to human programmers.

CodeElo's questions come from the CodeForces platform, which is known for its rigorous programming competitions. By submitting solutions directly to the CodeForces platform, CodeElo ensures the accuracy of evaluations, addresses issues such as false positives, and supports questions that require special judging mechanisms. In addition, the Elo rating system mirrors human rankings, allowing for effective comparisons between LLM and human contestant performance.

CodeElo's three core elements: comprehensiveness, robustness and standardization

Which is the best AI programming ability? Ali Tongyi Thousand Questions Qwen Launches CodeElo Benchmark, OpenAI o1-mini Wins Championship Over 90% Human Programmers

CodeElo is based on three key elements:

Comprehensive selection of issues: Topics are categorized by tournament division, difficulty level, and algorithmic labels to provide a comprehensive assessment.
Robust assessment methods. Submitted code is tested on the CodeForces platform, utilizing its special evaluation mechanisms to ensure accurate judgments, eliminating the need to hide test cases and providing reliable feedback.
Standardized rating calculations. The Elo rating system evaluates the correctness of code, takes into account problem difficulty, and penalizes errors to incentivize high-quality solutions, providing a careful and effective tool for evaluating coding models.

Test Results

After testing 30 open-source LLMs and 3 proprietary LLMs, OpenAI's o1-mini model performed the best, with an Elo score of 1578, outperforming 90% human participants; among the open-source models, QwQ-32B-Preview topped the list with a score of 1261.

Which is the best AI programming ability? Ali Tongyi Thousand Questions Qwen Launches CodeElo Benchmark, OpenAI o1-mini Wins Championship Over 90% Human Programmers

However, many of the models still struggled with solving simple problems and typically ranked behind the human participants.20% The analysis showed that the models performed well in categories such as mathematics and implementation, but were deficient in dynamic programming and tree algorithms.

In addition, the model performs better when coded in C++, which is consistent with the preferences of competitive programmers, and these results highlight areas for improvement in LLM.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Which is the best AI programming ability? Ali Tongyi Thousand Questions Qwen Launches CodeElo Benchmark, OpenAI o1-mini Wins Championship Over 90% Human Programmers

Musk says Grok 3 is coming: pre-training done, 10x more computation than Grok 2

Anthropic Concedes: Claude AI No Longer Generates Copyrighted Lyrics

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

Musk says Grok 3 is coming: pre-training done, 10x more computation than Grok 2

Anthropic Concedes: Claude AI No Longer Generates Copyrighted Lyrics

Alibaba Cloud fully promotes AI coding. In the future, 20% codes will be written by Tongyi Lingma

Alibaba's big model "Tongyi Qianwen" launches a new Spring Festival function: Generate family portraits

The large model is connected to the astronomical telescope for the first time: "Star Language 3.0" is released, based on Alitong Yiqianwen

Ali Tongyi Qianwen announced the launch of the new domain name "tongyi.ai", and added a deep search function to the web version of the chat

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow