Consecutive Decoding for Speech-to-text Translation

Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li

AAAI, 2021


Abstract
Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal crosslingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral framework for speech-to-text translation. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-of-the-art methods. Our code is available at https://github.com/dqqcasia/neurst.

[paper] [code]

Please cite as:

@article{dong2021consecutive,
  title={Consecutive Decoding for Speech-to-text Translation},
  author={Dong, Qianqian and Wang, Mingxuan and Zhou, Hao and Xu, Shuang and Xu, Bo and Li, Lei},
  year={2021}
}