Skip to the content.

Overview

Controllable text-to-speech (TTS) aims to achieve flexible and accurate control, synthesizing speech across various domains. Several recent works adopting natural language descriptions to control speech attributes has gained much attention. However, achieving accurate timbre control when utilizing text to manipulate speech style poses significant challenges. In light of this, we propose VS-TTS, a multi-modal prompt speech synthesis system that allows user-friendly control over voice styles while maintain speaker identity. Specifically, 1) We present the baseline model for the VS-TTS task, providing detailed descriptions of dataset preprocessing. 2) We employ a BERT-based text prompt encoder to extract a fixed-length speaking-style-correlated hidden. For speaker prompt, we leverage a multi-stream transformer encoder to learn diverse speaker attributes from multiple views and thus improve speaker similarity. 3) To improve style expressiveness and alleviate one-to-many problem, we introduce a diffusion-based Variation Enhance Network to provide finer grained additional variability information as a supplement for those acoustic features not coverage in natural language prompts. Our extensive evaluations in audiobook dataset (LibriTTS) and multi-corpus emotional datasets demonstrate that VS-TTS outperforms baseline models in terms of style controllability and speaker similarity.

Model Architecture

Figure.1 The overall architecture of VS-TTS.

 

Note: Due to promptTTS2 control speech style only by text descriptions, the synthesized speech has a randomly assigned speaker voice, here we have not listed corresponding results.

1. LibriTTS seen case

Attribure Control

1.1 Volume

Volume-1: A speaker is speaking softly: His eyes, which are hazel, are remarkably bright; he has a sight keen as a hawk’s.

Volume-2: The speaker speaks with normal energy: His eyes, which are hazel, are remarkably bright; he has a sight keen as a hawk’s.

Volume-3: A speaker with a vibrant voice: His eyes, which are hazel, are remarkably bright; he has a sight keen as a hawk’s.

Volume Audio Prompt VS-TTS InstructTTS PromptStyle
1
2
3

1.2 Speed

Speed-1: The speaker spoke at a slow pace: Spargo, much astonished at this reception, passed through an ante room into a handsomely furnished apartment full of books and pictures.

Speed-2: The speaker spoke at a normal pace: Spargo, much astonished at this reception, passed through an ante room into a handsomely furnished apartment full of books and pictures.

Speed-3: The speaker spoke at a fast pace: Spargo, much astonished at this reception, passed through an ante room into a handsomely furnished apartment full of books and pictures.

Speed Audio Prompt VS-TTS InstructTTS PromptStyle
1
2
3

1.3 Pitch

Pitch-1: The speaker says with a low-key voice: The King of the Golden River had hardly made the extraordinary exit related in the last chapter, before Hans and Schwartz came roaring into the house very savagely drunk.

Pitch-2: The speaker says with a normal-key voice: The King of the Golden River had hardly made the extraordinary exit related in the last chapter, before Hans and Schwartz came roaring into the house very savagely drunk.

Pitch-3: The speaker says with a high-key voice: The King of the Golden River had hardly made the extraordinary exit related in the last chapter, before Hans and Schwartz came roaring into the house very savagely drunk.

Pitch Audio Prompt VS-TTS InstructTTS PromptStyle
1
2
3

 

2. LibriSpeech test-clean(unseen case)

2.1 Attribute Control

Volume

Volume-1: A speaker is speaking softly: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

Volume-2: The speaker speaks with normal energy: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

Volume-3: A speaker with a vibrant voice: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

Model Audio Prompt Volume-1 Volume-2 Volume-3
VS-TTS

Speed

Speed-1: The speaker spoke at a slow pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

Speed-2: The speaker spoke at a normal pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

Speed-3: The speaker spoke at a fast pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

Model Audio Prompt Speed-1 Speed-2 Speed-3
VS-TTS

Pitch

Pitch-1: The speaker says with a low-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

Pitch-2: The speaker says with a normal-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

Pitch-3: The speaker says with a high-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

Model Audio Prompt Pitch-1 Pitch-2 Pitch-3
VS-TTS

2.2 Attribute Control with unseen speaker

sample1

With a low voice, the speaker engages in quick speech while maintaining usual vitality.

  1. mrs Harker began to blush, and taking a paper from her pockets, she said:
  2. The King of the Golden River had hardly made the extraordinary exit related in the last chapter, before Hans and Schwartz came roaring into the house very savagely drunk.
  3. Dykvelt, whose adroitness and intimate knowledge of English politics made his assistance, at such a conjuncture, peculiarly valuable, was one of the Ambassadors; and with him was joined Nicholas Witsen, a Burgomaster of Amsterdam, who seems to have been selected for the purpose of proving to all Europe that the long feud between the House of Orange and the chief city of Holland was at an end.
Audio Prompt:
VS-TTS

 

sample2

The speaker says quickly but in a quiet manner.

  1. But it isn’t good manners to tell your company what you are going to give them to eat, so I won’t tell you what she said we could have to drink.
  2. His own possessions, safety, life, he would have hazarded for Lucie and her child, without a moment’s demur; but the great trust he held was not his own, and as to that business charge he was a strict man of business.
  3. “But, Pencroft,” answered Spilett, “you are describing a picture of the Creator.”
Audio Prompt:
VS-TTS

sample3

Speaking slowly, the speaker had a high-pitched voice and a quiet, low-energy aura.

  1. But it isn’t good manners to tell your company what you are going to give them to eat, so I won’t tell you what she said we could have to drink.
  2. His own possessions, safety, life, he would have hazarded for Lucie and her child, without a moment’s demur; but the great trust he held was not his own, and as to that business charge he was a strict man of business.
  3. Hugh told him the name; and then made him look with the telescope all along the receding line to the trees on the opposite hill.
Audio Prompt:
VS-TTS

3. Emotional Speech Dataset test

sample1

Text prompt:A amazed speaker’s energetic high pitch electrifies and enlivens the listeners.

Text:To catch that bulrush root with my paw

Audio Prompt VS-TTS InstructTTS PromptStyle

sample2

Text prompt:With speaker’s distinctive deep voice, speaker sadly converses at a moderate speed and standard energy levels.

Text:But the ships are very slow now and we don’t get so many sailors any more.

Audio Prompt VS-TTS InstructTTS PromptStyle

sample3

Text prompt:With normal pitch, the furious speaker delivers an animated speech at a regular pace.

Text:Don’t ask me to carry an oily rag like that.

Audio Prompt VS-TTS InstructTTS PromptStyle

sample4

Text prompt:The afraid speaker swiftly expressed

Text:Dogs are sitting by the door.

Audio Prompt VS-TTS InstructTTS PromptStyle

sample5

Text prompt:The speaker gleefully talks at a normal pace.

Text:Zero four three a silver shilling is journey.

Audio Prompt VS-TTS InstructTTS PromptStyle

4.Non-reading-style speech dataset test

Speed

Speed-1: The speaker spoke at a slow pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

Speed-2: The speaker spoke at a normal pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

Speed-3: The speaker spoke at a fast pace: The judge refused to admit his evidence, on the ground that the witness had destroyed beforehand all the confidence of the Court in what he was about to say.

LRS3 Speaker Audio Prompt Speed-1 Speed-2 Speed-3
o1Z4F4e2Bw4_00003
k2hQL9Zrokk_00002

Pitch

Pitch-1: The speaker says with a low-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

Pitch-2: The speaker says with a normal-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

Pitch-3: The speaker says with a high-key voice: But upon the question of labour mr Grammont was fierce, even for an American business man, and one night at a dinner party he discovered his daughter displaying what he considered an improper familiarity with socialist ideas.

LRS3 Speaker Audio Prompt Pitch-1 Pitch-2 Pitch-3
GSf6nijSSdA_00004
UMhLBPPtlrY_00004

Volume

Volume-1: A speaker is speaking softly: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

Volume-2: The speaker speaks with normal energy: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

Volume-3: A speaker with a vibrant voice: It was a long ride over the circuitous route by which the steep incline was avoided and it was necessary for the party to make an early start.

LRS3 Speaker Audio Prompt Volume-1 Volume-2 Volume-3
jcp5vvxtEaU_00001
y6MC4iXhT6I_00001

5.More samples

sample1

Text prompt:Rapid speech with a high key characterizes speaker’s conversation style.

Text:She tied the unhappy dog up again, but do you think Nana ceased to bark? Bring master and missus home from the party!

Audio Prompt VS-TTS

sample2

Text prompt:A amazed speaker voice, low in pitch, speaks quickly while exuding normal energy.

Text:Come on my jack in the boxes!

Audio Prompt VS-TTS

sample3

Text prompt:Speaking quickly and with a deep pitch, speaker sustains a normal level of vitality.

Text:I ran back to the entrance compartment.

Audio Prompt VS-TTS

sample4

Text prompt:In speaker’s conversation, the speaker utilizes a standard pitch, conversing at a moderate pace, and exuding low energy.

Text:It was not the first time that I had helped him and been well paid for my help.

Audio Prompt VS-TTS

sample5

Text prompt:speaker’s speech maintains a usual pitch as speaker mournfully talks at a regular speed with a touch of standard energy.

Text:They’d never know I’d regular ran away.

Audio Prompt VS-TTS

sample6

Text prompt:A speaker talks quickly in a hushed manner.

Text:In the healthy marriage, this sympathetic response will soon give way to anger, which in turn may have the effect of a dash of cold water in the face of the oversensitive one, helping him or her to buck up and behave like an adult.

Audio Prompt VS-TTS

sample7

Text prompt:Rapid-speaking speaker in a quiet manner.

Text:In the daytime she sat down once more beneath the windows of the castle, and began to card with her golden carding comb; and then all happened as it had happened before.

Audio Prompt VS-TTS