Authors : Leila Zouari (LTCI), Gérard Chollet (LTCI)

Article : Articles dans des revues internationales ou nationales avec comité de lecture

This article investigates speech transcription within a framework of Embodied Conversational
Agent (ECA) animation by voice. The idea is to detect some pronounced expressions/keywords in
order to animate automatically the face and the body of an avatar.
Extensibility, speed and precision are the main constraints of this interactive application. So after
defining the set of the relevant words (to the application), a fast large vocabulary speech recognition
system was developed and the keyword detection was evaluated.
In order to fasten the recognition system without decreasing its efficiency, the acoustic models have
been shortened by an original process. It consists in decreasing the number of shared central states of
context dependent models which are considered stationary. The shared states located in the border of
the models remain inchanged. Then all the models are retrained.
The system is evaluated on an hour of the ESTER database (a French broadcast news database). The
experiments show that reducing the number central states of triphones is advantageous. Indeed, the
length of models is reduced by 20% with no loss of accuracy.