Meta releases Sapiens visual model to enable AI to analyze and understand human actions in images/videos

Meta Reality Labs has recently launched a new Sapiens of AI Vision Model, applicable to four basic human-centric vision tasks: 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction.

Meta releases Sapiens visual model to enable AI to analyze and understand human actions in images/videos

The number of parameters of these models varies from 300 million to 2 billion. They adopt the visual transformer architecture, where tasks share the same encoder but each task has a different decoder head.

  • 2D Pose Estimation:This task involves detecting and localizing key points of a human body in a 2D image. These key points usually correspond to joints such as elbows, knees, and shoulders, and help understand a person’s posture and movements.
  • Body Part Segmentation:This task segments an image into different body parts, such as head, torso, arms, and legs. Each pixel in the image is classified as belonging to a specific body part, which is useful for applications such as virtual try-on and medical imaging.
  • Depth Estimation:The task is to estimate the distance of each pixel in the image from the camera, effectively generating a 3D image from a 2D image. This is crucial for applications such as augmented reality and autonomous driving, where understanding the layout of a space is important.
  • Surface Normal Prediction:The task is to predict the orientation of surfaces in an image. Each pixel is assigned a normal vector that indicates which direction the surface is facing. This information is very valuable for 3D reconstruction and understanding the geometry of objects in the scene.

Meta releases Sapiens visual model to enable AI to analyze and understand human actions in images/videos

Meta said the model can natively support 1K high-resolution inference and is very easy to adjust for individual tasks, simply by pre-training the model on more than 300 million wild human images.

Even when labeled data is scarce or entirely synthetic, the generated models can show excellent generalization capabilities to in-the-wild data.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

"The first model in China with voice capabilities comparable to GPT-4o", Lingo voice AI model opens for internal testing

2024-8-25 9:23:56

Information

Amazon CEO Andy Jassy: AI assistant Amazon Q can save about 4,500 developers a year

2024-8-25 9:26:33

Search