On December 21st, "12 Days of OpenAI"The event has drawn to a close with OpenAI's o3 The series of large models on the stage.Officials claim that in some scenarios, its reasoning ability is very close to that of general-purpose artificial intelligence (AGI).
name
Why the latest AI model skips o2 and is called o3?OpenAI CEO Sam Altman, speaking at a live event this morning, said it's to circumvent a trademark conflict with British telecom operator O2.
Invitation to Security Testing
o3 is the successor to the o1 inference model and contains both a full version and a lite version (o3-mini), the latter of which has been fine-tuned primarily for specific tasks.
OpenAI has not yet fully opened up the o3 and o3-mini models, but is inviting security researchers to sign up for a preview version of the o3-mini model, and then launch a preview version of the o3 model.
Now, interested parties can submit an application at https://openai.com/index/early-access-for-safety-testing/.
Altman has not announced a specific open date for the o3 model, revealing only that the o3-mini will be launched at the end of January 2025, with the o3 to follow.
o3 Model reasoning
One of the biggest differences between OpenAI o3 models and mainstream AI models is that fact-checking is carried out so that some common modeling pitfalls can be circumvented, but this process incurs a response delay of typically a few seconds to a few minutes, depending on the difficulty of the reasoning.
Another highlight of the o3 family of models is the use of a "private chain of thought" for "thinking", which allows one to pause before responding, consider the cues and interpret their reasoning, and ultimately summarize the most accurate answer. the most accurate answer.
One of the new features of o3 is the ability to adjust the inference time, which is categorized into three computation levels: low, medium, and high; the higher the computation level, the better the task execution performance of o3.
Performance and AGI
The full name of AGI is artificial general intelligence, directly translated as general artificial intelligence, which refers to AI that can perform any task like humans, and is officially defined by OpenAI as "a highly autonomous system that exceeds human beings in the most economically valuable work".
The OpenAI company is moving aggressively towards its AGI goals, which have particular implications in the investment space, in addition to solidifying its position in the AI space.
Under the terms of OpenAI's deal with Microsoft, a close partner and investor, the company is no longer obligated to provide its state-of-the-art technology (i.e., technology that meets OpenAI's AGI definition) to Microsoft once OpenAI reaches AGI.
And o3 is OpenAI is an important step toward that goal, in the ARC-AGI benchmarkingThe o3 scored 87.5% on the high compute setting and 75.7% on the low compute setting, tripling the performance of the o1.
Admittedly high-computing setups are very expensive, costing thousands of dollars per task, says ARC-AGI co-founder François Chollet.
Citing the outlet, 1AI reported that the o3 performed well in other benchmarks:
- In the SWE-Bench Verified Programming Task Benchmark, o3 is better than o1 22.8 percentage points higher;
- In the Codeforces Programming Skills Test.o3 has received 2727 ratings;
- In the 2024 U.S. Math Invitational.o3 Score 96.71 TP3T;
- In the GPQA Diamond Postgraduate Level Biology, Physics and Chemistry test.o3 Score 87.7%;
- In EpochAI's Frontier Math Benchmark.o3 Resolved 25.2% (no other model exceeds 2%), setting a new record.
These results are from OpenAI's internal evaluation and await further validation from benchmarking results from external customers and organizations.
Safety
The release of o3 marks an important step for OpenAI in the field of general-purpose AI. o3's capabilities are impressive, but its potential risks require attention. While o3's capabilities are impressive, its potential risks need to be taken seriously, and OpenAI is committed to working on model safety and collaborating with other organizations to build a better benchmarking system.