Investigation of methods for machine learning associations between genetic variations and phenotype
The relationship between genetics and phenotype is a complex one that remains poorly understood. Many factors contribute to the relationship between genetic variations and differences in phenotype. An improved understanding of the genetic underpinnings of various phenotypes can help us make important advances in testing for, preventing, treating, and curing a number of diseases and disorders.
The recent popularization of direct-to-consumer sequencing services, coupled with consumers releasing their genetic information for public use, has led to an unprecedented level of access to genetic information. Crowd-sourcing the problem of developing robust genome-wide association techniques for ever larger amounts of data is a promising trend.
This thesis explores likely methods to data mine one such public genetic data repository, openSNP, for correlated genotypes and phenotypes. Particular care is given to data clean-up and the steps required to preprocess public data for machine learning. The preprocessing methods are detailed in such a way that they may be applied to other genetic data repositories that already exist, for example the Personal Genome Project, as well as genetic data repositories that may become available in the future. Following data clean-up, a number of machine learning techniques are investigated, applied, and assessed for their utility in such a big-data problem. No single machine learning approach was found to be sufficient; the combination of imbalanced phenotype response classes and an underdetermined system led to a difficult machine learning challenge. Additional techniques must be explored or developed in order to make such genome-wide association studies possible and meaningful.
Library of Congress Subject Headings
Machine learning; Phenotype--Data processing; Human genetics--Variation--Data processing
Department, Program, or Center
Thomas H. Gosnell School of Life Sciences (COS)
Gary R. Skuse
Rajendra K. Raj
Hartung, Amanda M., "Investigation of methods for machine learning associations between genetic variations and phenotype" (2016). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus
Physical copy available from RIT's Wallace Library at Q325.5 .H37 2016