Researchers Look at Issue of Data Redundancy in Machine Learning
Work by a group of researchers at the University of Kentucky’s Sanders-Brown Center on Aging was recently published in Genes. The article looks at the use of data mining and machine learning in research.
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (magnetic resonance imaging (MRI), biometrics, RNA expression, etc.) from Alzheimer’s disease cases and controls that have recently been used by machine learning algorithms to evaluate Alzheimer’s disease onset and progression. While using a variety of biomarkers is essential to Alzheimer’s disease research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models.
Justin Miller, Ph.D., assistant professor in the UK College of Medicine, directed this work through a collaboration with Mark Ebbert, Ph.D., assistant professor in the UK College of Medicine, and staff scientists Erik Huckvale and Matthew Hodgman. Together, they used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. Miller says through this work they found that greater than 90% of the biomarkers, gene expression data, and MRI data included in the ADNI dataset are very highly correlated with at least one other datatype, which could provide unforeseen challenges in using machine learning to identify patterns across the diverse data that are available in that dataset.
In this publication, Miller and his colleagues provide mappings of the highly correlated features so that future studies can consider this feature correlation and improve machine learning accuracy and efficiency in Alzheimer’s disease research.
“Feature correlation has always been an issue in large datasets, but it was previously unknown the extent to which this issue permeated the Alzheimer’s Disease Neuroimaging dataset,” said Miller. “This research will help improve data mining accuracy and efficiency in the ADNI dataset. Machine learning is a promising avenue of research to identify patterns that can one day improve patient care. This research lays the groundwork for those future analyses.”
This work was supported by the BrightFocus Foundation under Award Number A2020118F. Research reported in this publication was also supported by the National Institute of Aging of the National Institutes of Health under Award Numbers R01AG046171, RF1AG051550 and 3U01AG024904-09S4. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.