Abstract

As the world moves towards a more globalized scenario, it has brought along with it the extinction of several languages. It has been estimated that over the next century, over half of the world's languages will be extinct, and an alarming 43% of the world's languages are at different levels of endangerment or extinction already. The survival of many of these languages depends on the pressure imposed on the dwindling speakers of these languages. Often there is a strong correlation between endangered languages and the number and quality of recordings and documentations of each. But why do we care about preserving these less prevalent languages? The behavior of cultures is often expressed in the form of speech via one's native language. The memories, ideas, major events, practices, cultures and lessons learnt, both individual as well as the community's, are all communicated to the outside world via language. So, language preservation is crucial to understanding the behavior of these communities.

Deep learning models have been shown to dramatically improve speech recognition accuracy but require large amounts of labelled data. Unfortunately, resource constrained languages typically fall short of the necessary data for successful training. To help alleviate the problem, data augmentation techniques fabricate many new samples from each sample. The aim of this master's thesis is to examine the effect of different augmentation techniques on speech recognition of resource constrained languages. The augmentation methods being experimented with are noise augmentation, pitch augmentation, speed augmentation as well as voice transformation augmentation using Generative Adversarial Networks (GANs). This thesis also examines the effectiveness of GANs in voice transformation and its limitations. The information gained from this study will further augment the collection of data, specifically, in understanding the conditions required for the data to be collected in, so that GANs can effectively perform voice transformation. Training of the original data on the Deep Speech model resulted in 95.03% WER. Training the Seneca data on a Deep Speech model that was pretrained on an English dataset, reduced the WER to 70.43%. On adding 15 augmented samples per sample, the WER reduced to 68.33%. Finally, adding 25 augmented samples per sample, the WER reduced to 48.23%. Experiments to find the best augmentation method among noise addition, pitch variation, speed variation augmentation and GAN augmentation revealed that GAN augmentation performed the best, with a WER reduction to 60.03%.

Publication Date

11-2019

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Raymond Ptucha

Advisor/Committee Member

Cecilia Ovesdotter Alm

Advisor/Committee Member

Andres Kwasinski

Campus

RIT – Main Campus

Share

COinS