This work was supported in part by National Key R&D Program of China (Grant No. 2021YFC3300403), National Natural Science Foundation (Grant No. 62072382), National Science Foundation (OAC-2007661), and Yango Charitable Foundation.
SingingFace collects over 600 Chinese and English singing videos of 6 human subjects, totaling 40 hours with 30 FPS. Each video has a stable camera location and appropriate lighting conditions. We organize the dataset by recording singing videos ourselves. Specifically, we collect the singing audio set first, then the face region of the person singing the song with music played simultaneously is recorded. Finally, we automatically align each video to the corresponding music audio using SyncNet to ensure audio-visual synchronization. The following are some sample videos.
The dataset files are structured as follows:
For the download link of the full dataset, please contact zengming@xmu.edu.cn.