Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity
Control
Audio Samples
Speech quality and Speaker consistency
Same-speaker from ESD dataset
| emotion style | GT (reference) | baseline | proposed |
|---|---|---|---|
| Neutral | |||
| Angry | |||
| Sad | |||
| Happy |
Cross-speaker from VCTK dataset
| emotion style | seen speaker | unseen speaker (zero-shot) |
|---|---|---|
| Neutral | ||
| Angry | ||
| Sad | ||
| Happy |
Cross-speaker Emotion Intensity Controllability
Cross-speaker (seen) from VCTK dataset
| emotion style | alpha = 0.1 (weak) | alpha = 0.5 (medium) | alpha = 0.9 (strong) |
|---|---|---|---|
| Angry | |||
| Sad | |||
| Happy |
Cross-speaker (unseen) from VCTK dataset
| emotion style | alpha = 0.1 (weak) | alpha = 0.5 (medium) | alpha = 0.9 (strong) |
|---|---|---|---|
| Angry | |||
| Sad | |||
| Happy |