Abstract

In this thesis we explore different Spectral Regression Estimators in order to solve the prob- lem in regression where we have multiple columns that are linearly dependent: We explore two scenarios

• Scenario 1: p << n where there exists at least two columns; xj and xk that are nearly linearly dependent which indicates co-linearity and X⊤X becomes near singular.

• Scenario 2: n << p since there are more predictors than observations so some columns must be a linear combination of another column which indicates linear dependence.

The scenarios give us an ill conditioned matrix of X⊤X (when solving the normal equa- tion) due to collinearity issues and the matrix becomes singular and makes the least squares estimate unstable and impossible to compute. In the paper, we explore different methods (variable selection, regularization, compression and dimensionality reduction) that solves the above issue. For variable selection techniques, we use Stepwise Selection Regression as well as the method of Best Subset Selection regression. Two approaches for Stepwise Se- lection regression are assessed in the paper: Forward Selection and Backward Elimination. Performance assessment of our regression models will be made based on criterion based procedures like AIC,BIC,R2,R2 adjusted and the Mallow’s CP statistic. In chapter three of this paper we introduce the concepts of General Regularization, Ridge Regression as well

as subsequent shrinkage methods such as the Lasso, Bayesian Lasso and the Elastic net. Chapter five will look at Compression and Dimensionality reduction procedures which are outlined via SVD (Singular Value Decomposition) and Eigenvector Decomposition. Hard thresholding is subsequently introduced via SPCA (Sparse Principle Component Analysis) and a novel approach using RPCA (Robust Principle Component Analysis). Furthermore, RPCA also shows how it can aid with data and image compression. The basis of this study is concluded with an empirical exploration of all the methods outlined above using several performance indicators on simulated data and real data sets. Assessment of the data sets is done via cross-validation. We determine the optimal values of the settings and then evalu- ate the predictive and explanatory performance.

Publication Date

5-3-2019

Document Type

Thesis

Student Type

Graduate

Degree Name

Applied Statistics (MS)

Department, Program, or Center

School of Mathematical Sciences (COS)

Advisor

Ernest Fokoue

Advisor/Committee Member

Robert Parody

Advisor/Committee Member

Joseph Voelkel

Campus

RIT – Main Campus

Share

COinS