Article
Ingénierie & Outils numériques

Efficient AudioVisual Fusion Architectures for Emotion Recognition

Toutes les publications

Article : Articles dans des revues internationales ou nationales avec comité de lecture

Emotion recognition plays a critical role in the development of adaptive, human-aware intelligent systems. In this work, we propose an end-to-end audiovisual emotion recognition framework that integrates speech signals and facial expressions using lightweight deep learning architectures.
To develop the end-to-end architecture, we first benchmark several pretrained convolutional neural networks, employing confidence interval estimation to statistically evaluate the trade-off between recognition accuracy and model complexity. EfficientNetV2-B0 is identified as the most effective backbone for facial emotion recognition and is subsequently adopted as the feature extractor in our audiovisual framework.
Achieving a lightweight and efficient audiovisual emotion recognition system requires optimizing accuracy, robustness, and model size. We address this by proposing three progressively refined architectures that combine model-based and late fusion techniques. The baseline model employs a transformer-based architecture for audiovisual fusion. To handle potential modality absence, we introduce a variant that enhances the modeling of modality-specific characteristics. This is further strengthened through the integration of self-attention mechanisms within each modality, enabling the system to capture both cross-modal correlations and intra-modal dynamics effectively.
Extensive experiments conducted on the RAVDESS dataset demonstrate that our proposed architectures outperform existing state-of-the-art methods on this benchmark. Furthermore, the models exhibit strong performance with a low memory footprint, making them well-suited for resource-constrained devices.