Speech Recognition and Understanding
We have pioneered and accumulated experience and knowledge in the
area of automatic speech recognition and understanding in the past 2-3
decades. The following figure illustrated the technical paths that we
have helped the research community walk through. This ensemble of techniques
and technologies represent the foundation of most if not all automatic
speech recognition systems in use today.
Development
of fundamental techniques for automatic speech recognition
We continue to conduct research to lead the field by extending the
technology along the following directions:
- Robust speech recognition
A
major challenge to the deployment of an automatic speech recognition system
is how to maintain satisfactory recognition accuracy under all operating
conditions. It is well know that the current technology would experience
serious performance degradation if it is operated in a mismatch condition
(i.e., the condition in which the recognizer was not designed for). We
will focus on feature transformation and model adaptation techniques that
respond rapidly to changes in the operating condition or mode in order to
achieve a robust performance.
- Discrete sequence representation
One
major factor that influences the performance of an automatic speech recognition
system is the embedded knowledge of the language itself, in terms of the
grammar, the word sequence structure, and the associated semantics. The
grammar, or traditionally called the language model under the current technology
framework, is often expressed in finite state automata for its computational
advantages. Natural language is obviously not a finite-state machine.
Mathematical representation of a discrete sequence with arbitrary inter-symbol
relationship is a hard program that has intrigued research for some time.
We’ll explore new ideas in non-Markovian processes as candidates for language
representation.
- Semantic and emotional state detection
In
many applications, the goal of the automatic speech recognition and understanding
system is to identify the intent or intended action of the talker, rather
than the exact word sequence. For example, in CRM (Customer Relationship
Management) systems, the use of an automatic speech recognition and understanding
system ranges from routing a customer’s call to the right help person to
resolve issues, to recognizing the emotional state of a customer by detecting
relevant keywords or the prosodic information. We plan to investigate the
idea of latent semantic index (LSI) for semantic decoding as a supplement
to language modeling, as well as a means for emotional state detection by
creating a mathematical association between an emotional state and a set
of relevant words as organized by the LSI scheme.
- Natural dialog with referential semantics
The
ability to invoke pre-existed references in semantic expressions is a
major factor that contributes to the naturalness in human speech communications.
Our conversation would become unwieldy and unnatural if we have to define
every notion when it arises in the exchange. We have been able to demonstrate
that incorporation of deep referential semantics in the dialog management
design helps substantially in creating a natural language interface for
the task of personal calendar management. We’ll extend the use of referential
semantics to the speech decoding process, as a way to reduce recognition
errors due to the implied semantic constraints, and to other tasks such
as school course enrollment.
- A natural language speech server
for multi-channel multi-modal communications
|
|