This paper discusses automatic phonetic transcription to be applied in Hungarian speech recognition. It first deals with the basic technologies of automatic speech recognition (ASR) for the sake of readers not familiar with this scientific field, then it discusses the place of (automatic) phonetic transcription in ASR. After that, our method developed for transcribing Hungarian texts automatically is introduced. This technique is an extension of the traditional linear transcription approach; its output is called 'optioned' because it contains pronunciation options in parallel arcs. We present our experiences with promising improvements in recogniser training efficiency. The achievements are due to the application of deeper linguistic (phonological) knowledge. With the training technique developed not only the quality of the acoustic models can be enhanced, but also, at the same time, the amount of the required manual work can effectively be decreased.
Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.