EPT-MoE: Toward Efficient Parallel Transformers with Mixture-of-Experts for 3D Hand Gesture Recognition
Conférence : Communications avec actes dans un congrès international
The Mixture-of-Experts (MoE) is a widely known deep neural architecture where an ensemble of specialized sub-models (a
group of experts) optimizes the overall performance with a constant computational cost. Especially with the rise of Mixture-of-Experts
with Mixtral-8x7B Transformers, MoE architectures have gained popularity in Large Language Modeling (LLM) and Computer Vision.
In this paper, we propose the Efficient Parallel Transformers of Mixture-of-Experts (EPT-MoE) coupled with Spatial Feed Forward
Neural Networks (SFFN) to enhance the ability of parallel Transformer models with Mixture-of-Experts layers for graph learning of 3D
skeleton-data hand gesture recognition. Nowadays, 3D hand gesture recognition is an attractive field of research in human-computer
interaction, VR/AR and pattern recognition. For this purpose, our proposed EPT-MoE model decouples the spatial and temporal graph
learning of 3D hand gestures by integrating mixture-of-experts layers into parallel Transformer models. The main idea is to combine the
powerful layers of mixture-of-experts that process the initial spatial features of intra-frame interactions to extract powerful features from
different hand joints, and then, to recognize 3D hand gestures within the parallel Transformer encoders with layers of Mixture-of-Experts.
Finally, we conduct extensive experiments on benchmarks of the SHREC’17 Track dataset in order to evaluate the performance of EPTMoE model variations. EPT-MoE greatly improves the overall performance, the training stability and reduces the computational cost.
The experimental results show the efficiency of several variants of the proposed model (EPT-MoE), which achieves or outperforms the
state-of-the-art.