Fast Diffusion Model for Singing Voice Beautifying

We show the demo of DiffBeautifier:Fast Diffusion Model for High-Fidelity Singing Voice Beautifying

Overview

Singing voice beautifying (SVB) is a novel task that is widely used in practical scenarios. SVB task aims to correct the pitch of the singing voice and improve the expressiveness without changing the timbre and content. The major challenge of SVB is that paired data of professional songs and amateur songs is hard to obtain and we solved it for the first time. In this paper, we propose DiffBeautifier, an efficient diffusion model for highfidelity Singing Voice Beautifying. Since there are no paired data, diffusion model is adapted as our backbone, which is combined with modified conditions to generate our mel-spectrograms. We also reduce the number of steps of sampling t by using generator-based methods. For automatic pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose an expression enhancer in the latent space to convert the amateur vocal tone to the professional one. Furthermore, we produced a 40-hour singing dataset that contains original song vocals and extremely amateurish samples to promote the development of SVB. DiffBeautifier achieves a state-of-the-art beautification effect on both English and Chinese songs. Our extensive ablation studies demonstrate that expression part and generator-based methods in DiffBeautifier are effective.

Model Architecture

Figure.1 The overall architecture of DiffBeautifier.

Singing Audio Samples

There are four models in total: 1) GTMel, amateur (A) and professional (P) version, where we first convert ground truth audio into mel-spectrograms, and then convert the mel-spectrograms back to audio according to the vocoder. 2) Pitch Predictor, we first use the MIDI of the original singer, spectral envelope of amateur singing to predict our pitch curve. And then the predicted pitch curve, the spectral envelope of the amateur singing voice, and the aperiodic parameter of the amateur singing voice are used to synthesize the audio through the World Vocoder. 3)DiffBeautifier, this is the model proposed in this paper. All four models have a slight electrical sound because of our vocoder Griffin-Lim. Please pay more attention to the pitch and expressiveness of songs.

Chinese

1.在我心中曾经有一个梦

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

2.再没有恨，也没有了痛

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

3.你总说毕业遥遥无期转眼就各奔东西

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

4.东边牧马，西边放羊

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

5.生命已被牵引潮落潮涨

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

6.明天你是否还惦记曾经最爱哭的你

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

7.野辣辣的情歌就唱到了天亮

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

8.用我们的歌换你真心笑容

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

English

9.Because when the sun shines, we’ll shine together. Told you I’ll be here forever

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

10.Baby cause in the dark, you can’t see shiny cars

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

11.Together we’ll mend your heart

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

12.I said: No one has to know what we do

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

13.Wildest dreams

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

14.That we can baby, we can change and feel alright

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

15.Standin’ in a nice dress Starin’ at the sunset, babe

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav

16.Said I’ll always be a friend, took an oath. I’am stick it out till the end

	GT Amateur	GT Profession	Pitch Predictor	DiffBeautifier
wav