Author

Jo A. Bill

Abstract

This research evaluates pattern recognition techniques on a subclass of big data where the dimensionality of the input space p is much larger than the number of observations n. Seven gene-expression microarray cancer datasets, where the ratio κ = n/p is less than one, were chosen for evaluation. The statistical and computational challenges inherent with this type of high-dimensional low sample size (HDLSS) data were explored. The capability and performance of a diverse set of machine learning algorithms is presented and compared. The sparsity and collinearity of the data being employed, in conjunction with the complexity of the algorithms studied, demanded rigorous and careful tuning of the hyperparameters and regularization parameters. This necessitated several extensions of cross-validation to be investigated, with the purpose of culminating in the best predictive performance.

For the techniques evaluated in this thesis, regularization or kernelization, and often both, produced lower classification error rates than randomized ensemble for all datasets used in this research. However, no one technique evaluated for classifying HDLSS microarray cancer data emerged as the universally best technique for predicting the generalization error.1

From the empirical analysis performed in this thesis, the following fundamentals emerged as being instrumental in consistently resulting in lower error rates when estimating the generalization error in this HDLSS microarray cancer data:

• Thoroughly investigate and understand the data

• Stratify during all sampling due to the uneven classes and extreme sparsity of this data.

• Perform 3 to 5 replicates of stratified cross-validation, implementing an adaptive K-fold, to determine the optimal tuning parameters.

• To estimate the generalization error in HDLSS data, replication is paramount. Replicate R=500 or R=1000 times with training and test sets of 2/3 and 1/3, respectively, to get the best generalization error estimate.

• Whenever possible, obtain an independent validation dataset.

• Seed the data for a fair and unbiased comparison among techniques.

• Define a methodology or standard set of process protocols to apply to machine learning research. This would prove very beneficial in ensuring reproducibility and would enable better comparisons among techniques.

_____

1A predominant portion of this research was published in the Serdica Journal of Computing (Volume 8, Number 2, 2014) as proceedings from the 2014 Flint International Statistical Conference at Kettering University, Michigan, USA.

Library of Congress Subject Headings

Cancer--Data processing; Machine learning; Pattern recognition systems

Publication Date

7-2015

Document Type

Thesis

Student Type

Graduate

Degree Name

Applied Statistics (MS)

Department, Program, or Center

The John D. Hromi Center for Quality and Applied Statistics (KGCOE)

Advisor

Ernest Fokoue

Advisor/Committee Member

Steven LaLonde

Advisor/Committee Member

Daniel Lawrence

Comments

Physical copy available from RIT's Wallace Library at RC267 .B45 2015

Campus

RIT – Main Campus

Plan Codes

APPSTAT-MS

Share

COinS