October 31, 2010 - On October 30th, local time.OpenAI announced that in order to measureLanguage ModelThe accuracy of theOpen Sourcean organization called SimpleQA A new benchmark that measures the ability of language models to answer short fact-seeking questions.
- One of the open challenges in AI is how to train models to generatefactually correctThe Answer. Current language models sometimesProduces incorrect output or unverified answersThis question is referred to as an "illusion". Language models that generate more accurate and less hallucinatory responses are more reliable and can be used in a wider range of applications.
OpenAI states that the goal is to use SimpleQA to create a dataset with the following characteristics:
- High correctness:Reference answers to questions are verified by two independent AI trainers to ensure fairness in scoring.
- Diversity:SimpleQA covers a wide range of topics, from science and technology to TV shows and video games.
- Cutting edge challenging:Compared to earlier benchmarks such as TriviaQA (2017) or NQ (2019), SimpleQA is more challenging, especially for frontier models such as GPT-4o (e.g., GPT-4o scored less than 40%).
- Efficient User Experience:SimpleQA questions and answers are concise and clear, allowing for fast and efficient operation and quick scoring via OpenAI APIs and more. In addition, SimpleQA with 4326 questions should have low variance in the assessment.
SimpleQA will be aSimple but challengingbenchmark for evaluating the factual accuracy of frontier models.The main limitation of SimpleQA is its scope - although SimpleQA is accurate, it only measures factual accuracy in the constrained setting of short queries that are fact-oriented and have a verifiable answer.
OpenAI says that whether the facticity exhibited by the model in short answers is related to itsPerformance in long, multi-factual contentRelated, this is still ahanging in the balanceIt is also a research topic of SimpleQA. It is hoped that SimpleQA's open source will further advance the development of AI research and make models more credible and reliable.
With relevant addresses:
-
Open Source Links:https://github.com/openai/simple-evals/
-
Thesis:https://cdn.openai.com/papers/simpleqa.pdf