The bottleneck is the annotations: there's no easy way to annotate "emotions" on...

tsumnia · on Feb 14, 2024

Oh yeah, the annotations are lacking compared to images. Again from the academic side, I think one solution could be to recruit theater majors just learning about 'verbing their lines' and having a collaboration between CS and Theater to produce a a proof-of-work dataset (since an acting class won't have more than 20-30 students in it). You'd need significantly more annotations, but you'd now have some labels to ascribe to texts with context since its a dialogue involving 1-* individuals.

taneq · on Feb 15, 2024

I wonder how theatre students will feel about helping to train an AI to produce theatrical TTS? Artists seem pretty mad about their work being used to automate artwork.

isaacfung · on Feb 15, 2024

There are lots of video content with audio. We can train a facial expression classification model to detect the speaker's emotion(we can also use a multimodal model to take in consideration of the language context).

Another potential source of data is voice acting script of animations. I always thought the storyboards of films/animations can be great annotated training data but it seems there are no open datasets, probably because of copyright issues.

biomcgary · on Feb 14, 2024

Just run an LLM in sentiment analysis mode to annotate.

rhdunn · on Feb 15, 2024

That doesn't factor in line delivery. You can have the words say/mean one thing (e.g. "I'm fine.") and the delivery say/mean another (defensive, distraught, etc.).

It also does not account for where stresses, emphasis, pauses, etc. are placed to enhance the delivery of a given text.

How do you get sentiment analysis to properly annotate an audiobook that has a dramatic reading, or something akin to the narration of the Game of Thrones or Harry Potter books where the narrators switch characters, accents, manarisms to portray the written content?