Wuhan UniversityjointChina MobileThe Jiutian AI team and Duke Kunshan University used YouTube data toOpen SourceMore than 110,000 hours of audio and video speaker recognitionDatasetVoxBlink2The dataset contains 9,904,382 high-quality audio clips and their corresponding video clips from 111,284 users on YouTube. It is currently the largest publicly available audio and video speaker recognition dataset. The release of the dataset aims to enrich the open source speech corpus and support the training of large voiceprint models.
The VoxBlink2 dataset is mined through the following steps:
Candidate preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of video for processing.
Face Extraction & Detection: Extract video frames at a high frame rate and use MobileNet to detect faces, ensuring that the video track contains only a single speaker.
Face recognition: Pre-trained face recognizer recognizes each frame to ensure that the audio and video clips are from the same person.
Active speaker detection: Using lip movement sequences and audio, a multimodal active speaker detector outputs the utterance segment, and aliasing detection removes multi-speaker segments.
In order to improve the data accuracy, a bypass step of the in-set face recognizer was also introduced. Through rough face extraction, face verification, face sampling and training, the accuracy was improved from 72% to 92%.
VoxBlink2 also open-sources voiceprint models of different sizes, including a 2D convolutional model based on ResNet, a time series model based on ECAPA-TDNN, and a super-large model ResNet293 based on Simple Attention Module. After post-processing on the Vox1-O dataset, these models can achieve an EER of 0.17% and a minDCF of 0.006%.
Dataset website:https://VoxBlink2.github.io
How to download the dataset:https://github.com/VoxBlink2/ScriptsForVoxBlink2
Meta files and models:https://drive.google.com/drive/folders/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP
Paper address:https://arxiv.org/abs/2407.11510