• Conference
  • Engineering and Numerical Tools

Authors : Alexandre Lebas, Rim Slama, Hazem Wannous

Conférence : Communications avec actes dans un congrès international

In recent years, deep learning techniques have achieved remarkable
success in video analysis and more especially in
action and gesture recognition. Even though convolutional
neural networks (CNNs) remain the most widely used models,
they have difficulty in capturing the global contextual
information involving spatial and temporal domains or intermodality
due to the local feature learning mechanism. This
paper introduces a Capsule Transformer Network, which
composed of a frame capsule module for extracting hand features
and a gesture transformer module for modeling the temporal
features and recognizing the dynamic gesture. Spatial
attention is ensured through the capsule module to enhance
the spatial information of the hand image, while the transformer
module guarantees temporal attention through gesture
sequence. We propose to use multimodal data, including
RGB, depth and IR data, which improves the accuracy of our
approach as it better captures the 3D structure of the hand and
can distinguish between similar hand gestures. Testing on
two datasets, Briareo and SHREC17, the proposed approach
outperforms or equals previous methods.