Tsinghua University and Zhejiang University launch open source alternatives to GPT-4V! Open source visual models such as LLaVA and CogAgent explode

Recently, open source alternatives to GPT-4V in thetop notcheducational establishmentTsinghua University,Zhejiang Universityetc., a series of high-performanceOpen Source Visual Modeling. Among them, LLaVA, CogAgent and BakLLaVA are three highly regarded open source visual language models.

LLaVA is an end-to-end trainingMultimodal large model, which combines a visual coder with Vicuna for generalized visual and linguistic understanding with impressive chat capabilities. And CogAgent is an open-source visual language model that improves on CogVLM with 11 billion visual and 7 billion linguistic parameters.

In addition, BakLLaVA, a Mistral7B base model enhanced using the LLaVA1.5 architecture, has already outperformed LLaVA213B in several benchmarks.These three open-source vision models have great potential in the field of visual processing.

LLaVA demonstrated near GPT-4 levels of competence in Visual Chat and Reasoning Quiz. LLaVA's performance relative to the GPT-4 scored 851 TP3T in Visual Chat and a new SoTA of 92.531 TP3T in Reasoned Q&A exceeding the GPT-4. LLaVA was able to answer questions in a comprehensive and logical manner generating responses that could be output in JSON format.

It can not only extract information from images and answer questions, but also convert images into JSON format.LLaVA can also recognize CAPTCHAs, identify the variety of objects in a picture, and so on, demonstrating strong multimodal capabilities. With performance close to GPT-4, LLaVA is more cost-effective and training can be completed in 1 day with only 8 A100s.

CogAgent, as an improved open source visual language model based on CogVLM, has more features and performance advantages. It supports higher resolution visual input and dialog answering, and is capable of handling ultra-high resolution image input.

Tsinghua University and Zhejiang University launch open source alternatives to GPT-4V! Open source visual models such as LLaVA and CogAgent explode

Paper address:https://arxiv.org/pdf/2312.08914.pdf

CogAgent also provides the ability to visualize agents that can return the plan, next steps and specific actions with coordinates for any given task. It has also been enhanced with GUI-related question answering capabilities to handle questions related to any GUI screenshot from a web page, PC application, mobile application, etc. In addition, CogAgent has been enhanced for OCR-related tasks through improved pre-training and fine-tuning. These enhancements have enabled CogAgent to achieve multiple benchmarks on theFirstAdvanced generalized performance.

BakLLaVA is a Mistral7B base model enhanced with the LLaVA1.5 architecture for better performance and commercialization.BakLLaVA outperforms LLaVA213B in several benchmarks and allows for fine-tuning and inference on certain data. While BakLLaVA uses the LLaVA corpus during training and is not allowed to be commercially available, BakLLaVA2 uses a larger dataset and a newer architecture that surpasses the current LLaVA approach with commercial capabilities.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Samsung Galaxy S24 AI will usher in a "new era of mobile" with Note-like, foldable devices

2024-1-4 9:53:50

HeadlinesInformation

The Ministry of Science and Technology issued a document to regulate the use of AI. Researchers are prohibited from directly generating application materials with AIGC

2024-1-4 9:56:04

Search