It’s not Just What You Do but also When You Do It: Novel Perspectives for Informing Interactive Public Speaking Training

Most of the emerging public speaking training systems, while very promising, leverage temporal-aggregate
features, which do not take into account the structure of the speech. In this paper, we take a different perspective,
testing whether some well-known socio-cognitive theories, like first impressions or primacy and recency
effect, apply in the distinct context of public speaking perception. We investigated the impact of the temporal
location of speech slices (i.e., at the beginning, middle or end) on the perception of confidence and persuasiveness
of speakers giving online movie reviews (the Persuasive Opinion Multimedia dataset). Results show
that, when considering multi-modality, usually the middle part of speech is the most informative. Additional
findings also suggest the interest to leverage local interpretability (by computing SHAP values) to provide
feedback directly, both at a specific time (what speech part?) and for a specific behaviour modality or feature
(what behaviour?). This is a first step towards the design of more explainable and pedagogical interactive
training systems. Such systems could be more efficient by focusing on improving the speaker’s most important
behaviour during the most important moments of their performance, and by situating feedback at specific
places within the total speech.