Stanford UniversityThe Center for Research on Fundamental Models (CRFM) released the Massive Multitask Language Understanding on HELM ranking on June 11.Among the top ten language models, two are from Chinese manufacturers., respectively Alibaba's Qwen2 Instruct (72B) and Zero One Everything's Yi Large (Preview).
It is reported that the Large-Scale Multi-Task Language Understanding Evaluation (MMLU on HELM) uses a test method proposed by Dan Hendrycks et al. to measure the accuracy of text models in multi-task learning. This test includes basic mathematics, American history, computer science, law and other fields. 57 missionsTo get a high score on this test, a model must have extensive world knowledge and problem-solving skills. IT Home attached rankings are as follows:
▲ Image source: Stanford University Center for Basic Model Research official website
- 1. Claude 3 Opus (20240229): Anthropic (USA, Amazon investment)
- 2. GPT-4o (2024-05-13): OpenAI (USA)
- 3. Gemini 1.5 Pro: Google (USA)
- 4. GPT-4 (0613): OpenAI (USA)
- 5. Qwen2 Instruct (72B): Alibaba (China)
- 6. GPT-4 Turbo (2024-04-09): OpenAI (USA)
- 7. Gemini 1.5 Pro (0409 preview): Google (USA)
- 8. GPT-4 Turbo (1106 previews): OpenAI (USA)
- 9. Llama 3 (70B): Meta (USA)
- 10. Yi Large (Preview): Zero One Everything (China)
Qwen2 is an open-source large language model developed by Alibaba and released on June 6 this year. The Qwen2 series includes five pre-trained and instruction fine-tuning models of different sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B and Qwen2-72B; supports data training in 27 additional languages in addition to English and Chinese; Qwen2-7B-Instruct and Qwen2-72B-Instruct support long 128K The context of a token.
Yi Large is a closed-source large model developed by Zero One Everything. The Yi model series is based on 6B and 34B pre-trained language models, and then expanded to chat models,200K Long context model, deep upgrade model and visual language model. The official claimed that "it outperforms leading models such as GPT-4 and Claude 3 Opus in key benchmark scores."