Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation
Conférence : Communications avec actes dans un congrès international
We present in this paper a simple method for pruning tiles of
weights in sparse matrices, that do not require fine-tuning or retraining. This method is applied here to the feed-forward layers
of transformers. We assess in a first experiment the impact of
such pruning on the performances of speech recognition, machine translation, and the cascaded speech-to-text translation,
on the MuST-C database, for the English to French direction.
Depending on the size of the pruned tiles (from 4×4 to 32×32),
we observe that pruning rates from 15 to 40% for speech recognition and from 40 to 70% for machine translation are feasible
for a performance degradation of 10%. Applying this pruning
method to the systolic array accelerated version of the cascade
speech-to-text translation system results in speedups up to 74x
compared to the non-accelerated system. Energy consumption
also benefits from structured pruning with a maximum reduction of 35%.