Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control

Audio Samples

Speech quality and Speaker consistency


Same-speaker from ESD dataset


emotion style GT (reference) baseline proposed
Neutral
Angry
Sad
Happy

Cross-speaker from VCTK dataset


emotion style seen speaker unseen speaker (zero-shot)
Neutral
Angry
Sad
Happy

Cross-speaker Emotion Intensity Controllability


Cross-speaker (seen) from VCTK dataset


emotion style alpha = 0.1 (weak) alpha = 0.5 (medium) alpha = 0.9 (strong)
Angry
Sad
Happy

Cross-speaker (unseen) from VCTK dataset


emotion style alpha = 0.1 (weak) alpha = 0.5 (medium) alpha = 0.9 (strong)
Angry
Sad
Happy
```