Publications / CAPSULE TRANSFORMER NETWORK FOR DYNAMIC HAND GESTURE RECOGNITION USING MULTIMODAL DATA

CAPSULE TRANSFORMER NETWORK FOR DYNAMIC HAND GESTURE RECOGNITION USING MULTIMODAL DATA

octobre 2023

Ingénierie & Outils numériques

Communications avec actes dans un congrès international

Auteurs : Alexandre Lebas (CRISTAL), Rim Slama (LINEACT), Hazem Wannous (CRISTAL)
Conférence : 2023 IEEE International Conference on Image Processing, 7 octobre 2023

In recent years, deep learning techniques have achieved remarkable success in video analysis and more especially in action and gesture recognition. Even though convolutional neural networks (CNNs) remain the most widely used models, they have difficulty in capturing the global contextual information involving spatial and temporal domains or intermodality due to the local feature learning mechanism. This paper introduces a Capsule Transformer Network, which composed of a frame capsule module for extracting hand features and a gesture transformer module for modeling the temporal features and recognizing the dynamic gesture. Spatial attention is ensured through the capsule module to enhance the spatial information of the hand image, while the transformer module guarantees temporal attention through gesture sequence. We propose to use multimodal data, including RGB, depth and IR data, which improves the accuracy of our approach as it better captures the 3D structure of the hand and can distinguish between similar hand gestures. Testing on two datasets, Briareo and SHREC17, the proposed approach outperforms or equals previous methods.