Project: DETONATION - Discriminative training of speaker-normalized models for automatic speech recognition
Person in Charge: doc. Dr. Ing. Jan Černocký
Host institution: Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology
Country of Origin: India
Country of scientific activity: India
Project duration: 24 months
Scientific panel: Engineering and Information science
Abstract:
The proposed project deals with automatic speech recognition. It builds on the applicant’s experience with speaker normalization in speech recognition and on the experience of Speech@FITgroup with acoustic modeling and discriminative training in speech recognition. It proposes an investigation of discriminative training of speaker normalized models allowing for building more accurate speech recognition systems that are more adapted to the target user. Particular attention will be devoted to the application of discriminatively trained speaker adaptations in recently proposed sub-space acoustic modeling of speech.
Related links:
Host institution:
Brno University of Technology (BUT) https://www.vutbr.cz/en/
BUT is the second largest technical university in the Czech Republic. It comprises 8 faculties with
more than 20,000 students and 2,000 staff members.
The Faculty of Information Technology (FIT:http://www.fit.vutbr.cz/.cs)
provides education in the Bachelor and Master Study programs in Computer Science and Engineering and the Doctoral study program in Information Technology. Research activities include multimodal interaction, speech recognition, natural language processing, human-computer interaction, knowledge representation and reasoning, semantic web technologies, information extraction, knowledge mining, and technology enhanced learning. FIT is involved in the international cooperation with more than 30 research and education centres in Europe, USA, India and China. It has a strong support from leading industrial companies involved in IT development (Siemens, IBM, Microsoft, ScanSoft, etc.).
There are five research groups at the Department of Computer Graphics and Multimedia of FIT (http://www.fit.vutbr.cz/units/UPGM/index.php.en), which are led by 5 senior researchers and are staffed by about 15 post-docs and more than 50 postgraduate students. There are very few teams in the world that combine the cutting-edge research and development in speech recognition, video processing, and semantic technologies. The group has achieved excellent results in various internationally-recognized research competitions and challenges in these fields, it has participated in many European as well as national projects. The most relevant ones include: AMIDA - Augmented Multiparty Interaction with Distance Access, DIRAC – Detection and Identification of Rare Audiovisual Cues, WeKnowIt – Emerging, Collective Intelligence for personal, organisational and social use, TA2 – Together Anywhere, Together Anytime, Caretaker – Content Analysis and Retrieval Technologies to Apply Knowledge Extraction to massive Recording, KiWi – Knowledge in Wiki, M-Eco – Medical Ecosystem, and Mobio – Mobile Biometry. The group cooperates with a wide range of industrial partners (
The team has a significant track in developing advanced speech processing solutions. In the strong competition of IBM, BBN and other key players in the field, it achieved excellent results in various tracks of recent evaluation campaigns organized by the Multimodal Information Group of NIST (http://nist.gov/itl/iad/mig/). The group organizes scientific workshops attended by the top-most researchers in their respective domains (see, e.g., http://speech.fit.vutbr.cz/en/workshops/bosaris-2010) and develops software that is widely used by the research community as well as integrated into various commercial solutions (http://speech.fit.vutbr.cz/software). In the area of video processing, the group is regularly among the top teams participating in the TrecVid competitions. It has also a long list of commercial applications employing the advanced image and video analysis tools developed by the group members.Person in charge
Jan Cernocky (Dr. 1998 Universite Paris XI) is an associate professor and the Head of the Department of Computer Graphics and Multimedia, FIT BUT. He has been involved with several European projects: SPEECHDAT-E (4th FP, technical coordination), SpeeCon and Multimodal meeting manager (M4, both 5th FP), and Augmented Multimodal Interaction (AMI, 6th FP), leading the efforts of FIT in speech recognition, keyword spotting and multimodal-data recordings and annotations. He authored more than 40 papers in journals and at conferences. He has served as reviewer for conferences and journals, including IEEE Transactions on Speech and Audio Processing. He is on the scientific board of FIT, scientific board of Text-Speech-Dialogue conference, editorial board of the journal Radio engineering and on the board of Czechoslovak section of IEEE. In 2011, he served as co-chair of major signal processing conference: IEEE ICASSP 2011.
The objective of the project is to suggest new techniques of discriminative training of speaker-normalized models for speech recognition, lay mathematical foundations of these techniques, implement them into algorithms and carefully test on standard data. This will lead to more accurate speech recognition systems, that are better adapted to the target user, and in the same use less parameters than current systems, which allows for more efficient implementations and straightforward practical use.
Brief summary of activities implemented within the project over the relevant period since the beginning of the project:
The main objective of the project is to develop new methodologies for speaker adaptation in the context of automatic speech recognition. In particular, we proposed to develop class-discriminative algorithms for speaker adaptation.
The developments are summarized below.
1. We have developed a new approach for speaker adaptation, which is based on linear transformation of acoustic feature vectors. We refer to the method as Regional Feature-space Maximum Likelihood Linear Regression (R-FMLLR). The foundation of R-FMLLR is based on the fact that the effect of inter-speaker variability on speech can vary significantly depending on the speech unit being uttered. R-FMLLR aims to compensate class-specific speech variability with the help of class-specific linear transformation. We have developed a new feature transformation model and an estimation procedure for the region-specific transforms that is both theoretically and computationally attractive.
On a Large vocabulary continuous speech recognition (LVCSR) conversational telephone speech (CTS) task, we experimentally show that the performance of the proposed method is significantly better than global transformation.
The work will be submitted shortly to IEEE Signal Processing Letters under the title:
“Acoustic region-specific feature space transformation for speaker adaptation using quantized Gaussian posteriors,” by Shakti P. Rath, Lukas Burget, Martin Karafiat, Ondrej Glembek and Jan Cernocky.
The draft of this paper is attached with the report containing in detail the mathematical developments, algorithmization, implementation and experimental results. Some of the experimental results are outlined in the subsequent section.
2. We have developed a new way to estimate the speaker-adaptive linear transforms. We propose to use a constraint on the structure of the linear transforms, which is motivated by the fact that a linear transform can be factorized using QR decomposition. In this frame-work, we show that closed-form solution for the linear transform exits. Therefore, the proposed method is more attractive and would provide computational efficiency over F-MLLR that use numerical (iterative) algorithms for estimation.
In our initial experiments on the LVCSR CTS task, the method has shown promising results. We now continue to evaluate the performance of the method more rigorously.
The developments will be submitted to the Interspeech conference (Portland, USA) under the titles:
3. We have developed a scheme for multiple-parameter VTLN by considering acoustic-class, or, equivalently, acoustic-region, specific frequency warping. We have developed a computationally efficient approach to apply acoustic class-specific frequency warping and have evaluated its performance on speech recognition tasks. We define acoustic classes by dividing the entire acoustic space that consists of all Gaussian components in the HMM set into relatively homogeneous classes (or, regions). A separate warp-factor is estimated for each class, thus effecting acoustic-class specific frequency scaling. We also use regression class tree to ensure robust estimation of class-specific warp-factors when many classes are defined.
The work will be submitted shortly to IEEE Signal Processing Letters under the title:
“A Computationally Efficient Approach for Acoustic-class specific VTLN- warping using Regression Class Tree”. The experimental results are briefly summarized in the following section.
4. Apart from speaker adaptation, we are also trying to use the speaker modeling capabilities of the developed linear transforms (i.e., 1 and 2 as above) for the task of speaker identification. Based on the experimental results we are looking forward to submit the research outcomes to the conference – Odyssey-2012 in Singapore.