Abstract

Automatic speech recognition (ASR) techniques have improved extensively over the past few years with the rise of new deep learning architectures. Recent sequence-to-sequence models have been shown to have high accuracy by utilizing the attention mechanism, which evaluates and learns the magnitude of element relationships in sequences. Despite being highly accurate, commercial ASR models have a weakness when it comes to accessibility. Current commercial deep learning ASR models find difficulty evaluating and transcribing speech for individuals with unique vocal features, such as those with dysarthria, heavy accents, as well as deaf and hard-of-hearing individuals.

Current methodologies for processing vocal data revolve around convolutional feature extraction layers, dulling the sequential nature of the data. Alternatively, reservoir computing has gained popularity for the ability to translate input data to changing network states, which preserves the overall feature complexity of the input. Echo state networks (ESN), a type of reservoir computing mechanism employing a random recurrent neural network, have shown promise in a number of time series classification tasks. This work explores the integration of ESNs into deep learning ASR models.

The Listen, Attend and Spell, and Transformer models were utilized as a baseline. A novel approach that used the echo state network as a feature extractor was explored and evaluated using the two models as baseline architectures. The models were trained on 960 hours of LibriSpeech audio data and tuned on various atypical speech data, including the Torgo dysarthric speech dataset and University of Memphis SPAL dataset. The ESN-based Echo, Listen, Attend, and Spell model produced more accurate transcriptions when evaluating on the LibriSpeech test set compared to the ESN-based Transformer. The baseline transformer model achieved a 43.4% word error rate on the Torgo test set after full network tuning. A prototype ASR system was developed to utilize both the developed model as well as commercial smart assistant language models. The system operates on a Raspberry Pi 4 using the Assistant Relay framework.

Library of Congress Subject Headings

Automatic speech recognition--Technological innovations; Machine learning; Neural networks (Computer science); Pattern recognition systems; Articulation disorders--Data processing; Speech disorders--Data processing; English language--Pronunciation by foreign speakers--Data processing

Publication Date

5-2021

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Cory Merkel

Advisor/Committee Member

Alexander Loui

Advisor/Committee Member

Michael Stinson

Campus

RIT – Main Campus

Plan Codes

CMPE-MS

Share

COinS