OpenAI Study: Current AI Models Still Can't Compare to Human Programmers

February 24, 2011 - Despite OpenAI CEO Sam Altman insists that by the end of this year, theArtificial Intelligence Modelwill be able to outperform "low-level" software engineers, but a new study by the company's researchers suggests that even the most advanced AI models can't compete with human beings.programmerComparable.

In a new paper, the researchers note that even cutting-edge models -- that is, those of the most innovative and groundbreaking AI systems -- are"Still can't solve most" programming tasksTo that end, the researchers developed a new benchmarking tool called SWE-Lancer. To that end, the researchers developed a new benchmarking tool called SWE-Lancer, based on more than 1,400 software engineering tasks on the freelance website Upwork. With this benchmark, OpenAI tested three large language models (LLMs): its own o1 reasoning model, its flagship GPT-4o, and Anthropic's Claude 3.5 Sonnet.

Specifically.This new benchmark evaluates the performance of these LLMs when handling two types of tasks on Upwork:: One category is individual tasks, which involve fixing vulnerabilities and implementing fixes; the other category is management tasks, which require the models to make higher-level decisions from a more macro perspective. It is worth noting that the models were denied access to the Internet during the testing process.Therefore they cannot directly copy similar answers already available online.

These models have been tasked with tasks worth hundreds of thousands of dollars on Upwork, but they only solve superficial software problems and don't really get to the bottom of vulnerabilities and their root causes in large projects. Such "half-baked" solutions are not new to those who have experience working with AI -- they're not new to the world of AI.AI is good at outputting confident-sounding information, but is often full of holes when scrutinized.

While the paper notes that the three LLMs are often able to accomplish tasks "far faster than humans," they are unable to understand the breadth of the vulnerabilities and their context.This results in solutions that are "wrong or incomplete".

The researchers explain that Claude 3.5 Sonnet outperforms the other two OpenAI models and "earns" more than o1 and GPT-4o in the tests.Most of their answers are still wrong. The researchers noted thatAny model needs to be "more reliable" if it is to be used for real programming tasks.

In short, the paper seems to suggest that while these cutting-edge models are capable of handling a number of detailed tasks quickly, their skill level in handling these tasks is still far less than that of a human engineer.

While these large-scale language models have made rapid progress in recent years and will continue to advance in the future, their current skill level in the field of software engineering is still not sufficient to replace humans. Yet 1AI notes that this doesn't seem to have stopped some CEOs from firing human programmers in favor of these immature AI models.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Google's AI video generation model Veo 2 usage fees announced: $30 per minute

2025-2-24 11:12:43

Information

Sources say Poundland has formed several large modeling group teams for internal horse racing PK

2025-2-24 21:53:34

Search