GoogleResearchup to datePublishedScreenAIThis marks another important development in language and voice control of computer interfaces.AI ModelsNot only can they comprehend user interfaces and infographics, but they also set new performance benchmarks in multiple tasks such as answering infographic-based questions, summarizing content, and navigating user interfaces.
ScreenAI's core innovation lies in its textual representation of screenshots. The model is able to identify the type and location of UI elements, using synthetic training data generated by Google's LLM PaLM2-S, enabling it to answer questions about screen information, screen navigation, and summarizing screen content.
To achieve this innovation, ScreenAI combines Google's previous technical advances, such as the PaLI architecture and Pix2Struct's flexible patching mechanism. The latter divides graphics into variable grids based on aspect ratio. ScreenAI processes image and text inputs through an image encoder and a multimodal encoder, and then uses an autoregressive decoder to generate text output.
Experiments conducted by researchers have shown that model performance improves as model size increases. This suggests that performance can be further improved by scaling up the model. ScreenAI outperforms models of similar size in various benchmarks.optimal, often outperforming larger models. Additionally, using optical character recognition (OCR) to extract text content from screenshots has a slight positive impact on model performance.
However, despite ScreenAI's milestone in digital content understanding, the model cannot yet perform generated actions. The researchers noted that while there are currently some language models that run on smartphones, there is a lack of more powerful multimodal models that can combine text, images, audio, and video. They predict that with the development of models like ScreenAI, automated processing of smartphones and user interfaces using only natural language will become more advanced in the near future.
The researchers stressed that while their specialized model isoptimalHowever, further research is still needed on some tasks to narrow the gap with larger models such as GPT-4 and Gemini. To encourage more development, Google Research plans to release evaluation datasets for ScreenAI, of which ScreenQA already provides 86,000 question-answer pairs containing 36,000 screenshots; more complex variants and collections containing screenshots and their text descriptions will be launched soon.