-
Google Releases Multimodal Live Streaming API: Unlocking Watching, Listening, and Speaking, Opening a New Experience in AI Audio and Video Interaction
Google released Gemini 2.0 yesterday along with a new Multimodal Live API to help developers build apps with real-time audio and video streaming capabilities. The API enables low-latency, bi-directional text, audio and video interactions with audio and text output for a more natural, smooth, human-like interactive experience. Users can interrupt the model at any time and interact with it by sharing camera input or screen recordings to ask questions about the content. The model's video comprehension feature extends the communication model...- 506
-
Samsung's multimodal AI model Gauss 2 debuts to empower the Galaxy Intelligent Ecosystem
In a blog post today (October 21), Samsung announced the launch of its second-generation generative AI model, Samsung Gauss2, at a developer conference in South Korea. The multimodal language model is capable of processing multiple data types such as text, code, and images simultaneously, providing a significant boost in performance and efficiency. Gauss2 is available in three different sizes - Compact, Balanced and Supreme - to meet the needs of different computational requirements. Gauss2 offers three different models with different specifications to meet the needs of different computing environments and application scenarios, which are briefly summarized by IT Home as follows...- 802
-
Mistral Releases Pixtral Large Multimodal AI Model: Tops Complex Math Reasoning, Diagram/Document Reasoning Over GPT-4o
Nov. 19 - Mistral AI announced yesterday, Nov. 18, a new multimodal AI model, Pixtral Large, with 124 billion parameters, based on Mistral Large 2, and designed primarily for processing text and images. Pixtral Large is now available under the Mistral Research License and Commercial License for research, education, and commercial use. Pixtral Large is a Mistral ... -
The open source multimodal behemoth is here! Meta will launch the Llama 3 405B model on July 23
Meta is about to make something big again! They are about to release an open source language model called Llama3405B, which is not only their largest model to date, but also the largest open source language model in history. This behemoth, with a staggering 405 billion parameters, can shuttle between images and text, completely subverting the old calendar that can only process text. Key points: Meta will release Llama3405B, a multimodal model with 405 billion parameters, on July 23. The decision to open source Llama3405B and its weights may completely…- 6.7k
-
Google launches multimodal VLOGGER AI: making static portraits move and "talk"
Google recently published a blog post on its GitHub page, introducing the VLOGGER AI model. Users only need to input a portrait photo and an audio content, and the model can make these characters "animate" and read the audio content with facial expressions. VLOGGER AI is a multimodal diffusion model suitable for virtual portraits. It is trained using the MENTOR database, which contains more than 800,000 portraits and more than 2,200 hours of videos, allowing VLOGGER to generate different... -
Back in the game! Gemini-Pro's multimodal capabilities are on par with GPT-4V
The recent Gemini-Pro evaluation report shows that it has made significant progress in the multimodal field, comparable to GPT-4V, and even better in some aspects. First, in the comprehensive performance on the multimodal proprietary benchmark MME, Gemini-Pro surpassed GPT-4V with a high score of 1933.4, showing its comprehensive advantages in perception and cognition. Among the 37 visual understanding tasks, Gemini-Pro performed outstandingly in tasks such as text translation, color/landmark/person recognition, and OCR, showing its excellent ability in the basic perception field. …- 3.1k
-
Gemini: An AI assistant developed by Google for writing, planning, and learning
Gemini is a new generation of artificial intelligence system launched by Google DeepMind. It is capable of multimodal reasoning and supports seamless interaction between text, images, video, audio, and code.Gemini has surpassed its previous state in many areas such as language comprehension, reasoning, math, and programming, making it one of the most powerful AI systems to date. Gemini is available in three different sizes to meet a wide range of needs, from edge computing to the cloud, and can be used in a wide range of applications, including creative design, writing assistance, question answering, code generation, and more. Gemini Features Writing Assistant: Ge...- 2.3k
❯
Search
Scan to open current page
Top
Checking in, please wait
Click for today's check-in bonus!
You have earned {{mission.data.mission.credit}} points today!
My Coupons
-
¥CouponsLimitation of useExpired and UnavailableLimitation of use
before
Limitation of usePermanently validCoupon ID:×Available for the following products: Available for the following products categories: Unrestricted use:Available for all products and product types
No coupons available!
Unverify
Daily tasks completed: