N-Best Search Methods Applied to Speech Recognition

Diploma Thesis (Diplomarbeit), Department of Telecommunications, Norwegian Institut of Technology, 1994

Heiko Purnhagen

Universitetet i Trondheim, Norges Tekniske Høgskole, Institutt for Teleteknikk
Universität Hannover, Institut für Theoretische Nachrichtentechnik und Informationsverarbeitung

Introduction

The research on automatic recognition of human speech by machines started more than four decades ago. Systems capable of recognising up to several thousand words spoken as isolated utterances are already commercially available. The speaker independent recognition of continuous speech is much more complex and represents a main field of todays research in speech recognition. Most of the successful systems are based on Hidden Markov Models (HMMs). These statistical models have shown to be powerful and versatile and allow to model the speech process on different levels using the same concept. Thus, also language models like a grammar can be easily included in the recogniser.

When developing a speech recognition system, it is normally not sufficient to provide only models of the words that are to be recognised. To obtain reasonable performance, it is common to include additional knowledge sources in the recognition process. These can be models of the natural language or also knowledge about the task the recognition system will be used for. If all knowledge sources are simultaneously included, the search for the most likely word string (hypothesis) for a spoken utterance might become very complex. Recently, techniques that allow to apply the different knowledge sources sequentially and thus reduce computation have been proposed. These techniques are based on the N-best search paradigm. The Viterbi algorithm, which is normally used in HMM based recognition systems, is only able to find the best hypothesis. Now, an N-best algorithm generating the list of the N most likely hypotheses is required. These N-best hypotheses then can be rescored according to additional knowledge sources. In this way, also complex knowledge sources can easily be included in the recognition process.

Several different N-best algorithms for generating the list of the best N hypotheses have been proposed recently. Some of these algorithms generate an exact list of the N most likely hypotheses while others use different approximations to reduce computation. Besides for the N-best search paradigm, N-best algorithms can also be used for other new applications.

The parameters specifying an HMM based recogniser are found in training procedures that optimise these parameters in order to maximise an object function (e.g.\ likelihood of a training utterance). An important exception is the lexicon that is required in a sub-word based recogniser to build word models from the models of basic recognition units. This lexicon contains transcriptions in terms of basic recognition units for each word in the vocabulary and is normally generated by using a pronunciation dictionary or by an experienced phonetician. Different techniques for the automatic generation of entries in such a lexicon have been proposed. They require training utterances of the word and can also take into account the word's spelling. A modified version of one of the proposed N-best algorithms can be used to find the lexicon entries (transcriptions) that are optimal with respect to their likelihood for the training utterances.

The work done in this thesis can be divided into two main parts:

In the first part, different N-best algorithms were studied. They will be described and compared in this report. The tree-trellis algorithm, an exact and efficient N-best algorithm, was implemented as a part of this thesis work. This implementation is based on an existing continuous speech recogniser which is part of the HMM Toolkit (HTK). HTK is a collection of programs that allows to build and test HMM based recognisers in an efficient and flexible way. The implementation, which required several adaptations of the original tree-trellis algorithm, and the optimisations that were done in order to obtain maximum performance, will be described. This new N-best recogniser was tested on the 1000 word DARPA resource management task using different HMMs and different grammars. The test results and the computation and memory requirements will be reported.

In the second part, the automatic generation of lexica for a phoneme based recogniser was studied. The implementation of the tree-trellis algorithm was modified and extended to perform a search for the most likely hypotheses given a set of utterances of the same word. The modifications and the optimisation of this program will be described. The search for a new lexicon entry was in several cases too complex for the available hardware, although memory requirements were minimised. Therefore, different approximative techniques to find a new lexicon entry using a search with reduced complexity were developed and examined. Finally, several lexica were generated using these different techniques. The performance of these lexica and their effect on the results of recognition tests will be presented.

This thesis report is organised as follows: In Chapter 2, a short overview of the fundamentals of speech recognition is given. The concept of Hidden Markov Models is reviewed and the HMM Toolkit (HTK) is described. In Chapter 3, different N-best algorithms are presented and their features are compared. An implementation of the tree-trellis algorithm based on HTK's Viterbi recogniser is described and the results of recognition tests are reported. In Chapter 4, concepts for automatic lexicon generation are presented and the implementation of a modified tree-trellis algorithm for multiple utterances is described. Details and problems of the lexicon generation process are discussed and finally, the performance of different automatically generated lexica is examined. In Chapter 5, this report is summarised and conclusions are drawn. Appendix A contains the user manuals for the different programs that were written as a part of this thesis work.

The diploma thesis is available as PDF file.

Related publications

T. Svendsen, F. K. Soong, and H. Purnhagen,
“Optimizing baseforms for HMM-based speech recognition,”
Proc. EUROSPEECH, Sept. 1995. Abstract

T. Holter,
Maximum likelihood modelling of pronunciation in automatic speech recognition,
Ph.D. dissertation, The Norwegian University of Science and Technology, Trondheim, NO, Dec. 1997. ISBN 82-471-0174-2.