Audio Samples from

"Word-level Text Markup for Prosody Control in Speech Synthesis"

Yuliya Korotkova*‡, Ilya Kalinovskiy*†, Tatiana Vakhrusheva*‡

*Just AI, ‡Higher School of Economics, Russia, †National Research Tomsk Polytechnic University




Abstract

Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model in order to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.



1. Comparison of different TTS architectures

These examples, taken from MOS tests, show the influence of the proposed markup on different TTS architectures. We compared FastSpeech2 and Tacotron2 in two variants:

Baseline: the prosodic markup is excluded from the input to TTS model.
Prosody: the prosodic markup is included in the input to TTS model.



English



BaselineProsody
 
1: There were three men at my cabin door, besides the four within.
Fastspeech2
Tacotron2
 
2: It was they who formed the chief part of the small select group of spectator.
Fastspeech2
Tacotron2
 
3: Still I was in great trouble from the riotous and insolent behaviour of the boat's crew.
Fastspeech2
Tacotron2
 
4: And a truly widowed heart, my dear girl, does not easily match itself again.
Fastspeech2
Tacotron2
 
5: But less preoccupied by that pretty face than d'Artagnan, he had fancied he saw a second head, a man's head, inside the carriage.
Fastspeech2
Tacotron2



Russian



BaselineProsody
 
1: Как ты смог столько выучить, если у тебя было всего два дня?
Fastspeech2
Tacotron2
 
2: Послушай, Володя, тебе ни разу не приходило в голову, что никогда, понимаешь, никогда двое людей не поймут вполне друг друга?
Fastspeech2
Tacotron2
 
3: Некоторые лошади были потеряны.
Fastspeech2
Tacotron2
 
4: К примеру, как-то понадобилось привезти туда бульдозер.
Fastspeech2
Tacotron2
 
5: Анна твоя сестра или подруга?
Fastspeech2
Tacotron2
 
6: Пока общество коченеет, частная жизнь делается все злее.
Fastspeech2
Tacotron2



Portuguese



BaselineProsody
 
1: Temos o instinto de tudo.
Fastspeech2
Tacotron2
 
2: No entanto, essas palestras nunca explicam o que são as famosas memórias erradas.
Fastspeech2
Tacotron2
 
3: Tudo que ele empreendera — a expedicao, a salvacao do Endurance e duas tentativas de caminhar ate terra firme — fracassara miseravelmente.
Fastspeech2
Tacotron2
 
4: Ou talvez você tenha sido intenso demais — responde Theo, inclinando-se para a frente.
Fastspeech2
Tacotron2
 
5: Essa garrafa, porém, não estava rotulada como veneno.
Fastspeech2
Tacotron2


2. Controlability

These examples show how the identical text can be pronounced differently with prosodic markup modifications. The examples are synthesyzed on different speakers.



English



FastSpeech2Tacotron2
 
1a: How beautiful0 you are today4!
 
1b: How beautiful you are9 today7!
 
2a: Do you know what0 time7 it is9?
 
2b: Do you0know0what time it is3?
 
3a: Where is the washroom0, do you know1?
 
3b: Where5 is the washroom3, do you know5?



Russian



FastSpeech2Tacotron2
 
1a: Не подскажете3, сколько сейчас времени1?
 
1b: Не подскажете6, сколько7 сейчас времени4?
 
2a: А где7 находится уборная, вы случайно не0знаете0?
 
2b: А где находится уборная, вы случайно7 не знаете1?
 
3a: Как чудесно0 ты сегодня выглядишь7!
 
3b: Как чудесно ты сегодня3выглядишь2!



Portuguese



FastSpeech2Tacotron2
 
1a: Como você está linda0hoje1!
 
1b: Como você4 está4 linda hoje2!
 
2a: Você1 sabe2 que horas são0?
 
2b: Você sabe que7 que5 são4?
 
3a: Onde fica o banheiro9, você sabe3?
 
3b: Onde8 fica5 o banheiro, você sabe7?


These examples show how different prosodic tags influence the intonation on a single word. The target word is highlighted in red in text and is put between red dashed lines in spectrograms.All other words are marked with neutral intonation.



English



We had been wandering, indeed, in the leafless shrubbery an hour in the morning.
FastSpeech2 Tacotron2
-1
0
1
2
3
4
5
6
7
8
9



Russian



Правда, утром мы еще побродили часок по дорожкам облетевшего сада.
FastSpeech2 Tacotron2
-1
0
1
2
3
4
5
6
7
8
9



Portuguese



Na verdade, pela manhã, tínhamos andado durante uma hora entre os arbustos desfolhados.
FastSpeech2 Tacotron2
-1
0
1
2
3
4
5
6
7
8
9