SingingFace

This work was supported in part by National Key R&D Program of China (Grant No. 2021YFC3300403), National Natural Science Foundation (Grant No. 62072382), National Science Foundation (OAC-2007661), and Yango Charitable Foundation.

SingingFace collects over 600 Chinese and English singing videos of 6 human subjects, totaling 40 hours with 30 FPS. Each video has a stable camera location and appropriate lighting conditions. We organize the dataset by recording singing videos ourselves. Specifically, we collect the singing audio set first, then the face region of the person singing the song with music played simultaneously is recorded. Finally, we automatically align each video to the corresponding music audio using SyncNet to ensure audio-visual synchronization. The following are some sample videos.

Rendered 3D singing face video.

Details of dataset files.

The dataset files are structured as follows:

Video0
- info.json — a json file contains the information of the video, including the fps, number of frames.
- coeff.npy — a numpy’s format array file containing the rescontructed 3DMM model coefficients of each frame using Deep3DFaceRecon_pytorch.
- fitted_pose.npy — a numpy’s format array file containing the refited head pose parameters using the refitting code from face3d.
- all.wav — the origin music wave file.
- vocals.wav — the seperated human vocal wave file using spleeter.
- accompaniment.wav — the seperated background music wave file using spleeter.
- openface.csv — the output data of OpenFace, which contains eye blinking Action Unit (AU45r).
Video…

For the download link of the full dataset, please contact zengming@xmu.edu.cn.