Wuhan University and China Mobile's Jiutian AI team jointly open-sourced the audio and video speaker recognition dataset VoxBlink2

Wuhan UniversityjointChina MobileThe Jiutian AI team and Duke Kunshan University used YouTube data toOpen SourceMore than 110,000 hours of audio and video speaker recognitionDatasetVoxBlink2The dataset contains 9,904,382 high-quality audio clips and their corresponding video clips from 111,284 users on YouTube. It is currently the largest publicly available audio and video speaker recognition dataset. The release of the dataset aims to enrich the open source speech corpus and support the training of large voiceprint models.

Wuhan University and China Mobile's Jiutian AI team jointly open-sourced the audio and video speaker recognition dataset VoxBlink2

The VoxBlink2 dataset is mined through the following steps:

Candidate preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of video for processing.

Face Extraction & Detection: Extract video frames at a high frame rate and use MobileNet to detect faces, ensuring that the video track contains only a single speaker.

Face recognition: Pre-trained face recognizer recognizes each frame to ensure that the audio and video clips are from the same person.

Active speaker detection: Using lip movement sequences and audio, a multimodal active speaker detector outputs the utterance segment, and aliasing detection removes multi-speaker segments.

In order to improve the data accuracy, a bypass step of the in-set face recognizer was also introduced. Through rough face extraction, face verification, face sampling and training, the accuracy was improved from 72% to 92%.

VoxBlink2 also open-sources voiceprint models of different sizes, including a 2D convolutional model based on ResNet, a time series model based on ECAPA-TDNN, and a super-large model ResNet293 based on Simple Attention Module. After post-processing on the Vox1-O dataset, these models can achieve an EER of 0.17% and a minDCF of 0.006%.

Dataset website:https://VoxBlink2.github.io

How to download the dataset:https://github.com/VoxBlink2/ScriptsForVoxBlink2

Meta files and models:https://drive.google.com/drive/folders/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP

Paper address:https://arxiv.org/abs/2407.11510

 

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Google Gemini major update: multi-language support, performance improvement, open to teenagers

2024-7-26 9:35:41

Information

Stability AI releases Stable Video 4D, a generative model for converting a single video into multiple views

2024-7-26 9:37:55

Search