Graph Transformer Mixture-of-Experts (GTMoE) for 3D Hand Gesture Recognition
Conférence : Communications avec actes dans un congrès international
Mixture-of-experts (MoE) architectures have gained popularity in achieving high performance in a wide range of challenging tasks in Large Language Modeling (LLM) and Computer Vision, especially with the rise of Mixture-of-Experts with Mixtral/Mistral-7B Transformers. In this work, we propose the Graph Transformer Mixture-of-Experts (GTMoE) deep learning architecture to enhance the ability of the Transformer model with layers of Mixture-of-Experts (MoE) and Graph Convolutional Networks (GCNs) for graph learning of hand gesture recognition using 3D hand skeleton data as a challenging task. The main challenge is how to integrate Mixture-of-Experts (MoE) with graphs for 3D hand gesture recognition. In this context, the GTMoE transformer aims to overcome the efficiency use and the integration of MoE architectures with GCN for 3D hand gesture recognition. Recently, 3D hand gesture recognition has become one of the most attractive fields of research in human-computer interaction and pattern recognition. For this task, the proposed model GTMoE decouples the graph spatial and temporal learning of 3D hand gestures by integrating layers of the mixture-of-experts into a Transformer model and Graph Convolutional Networks. The principal idea is to combine the powerful layers of Mixture-of-Experts (MoE) with a Spatial Graph Convolutional Network (SGCN) that preprocess the initial spatial features of intra-frame interactions to extract powerful features from different hand joints, and then, to recognize hand gestures within the Mixture-of-Experts (MoE) of Transformer Encoder. Finally, we evaluate the performance of GTMoE Transformers on benchmarks of the SHREC’17 Track dataset. The experiments show the efficiency of some model variations of the proposed Graph Transformer Mixture-of-Experts (GTMoE), which achieves or outperforms the state-of-the-art.