• Article
  • Dec 19 2024

UK researcher using AI to unravel proteins, advance medical research

Qing Shao at desk

How can artificial intelligence (AI) revolutionize the fight against some of the world’s most devastating diseases, such as cancer and Alzheimer’s?

At the University of Kentucky, one researcher is harnessing the power of AI to uncover answers that could transform treatment, improve outcomes and give hope to millions.

Qing Shao, Ph.D., an assistant professor in the Department of Chemical and Materials Engineering in the Stanley and Karen Pigman College of Engineering, has been awarded more than $1.3 million from the National Institute of Health (NIH)’s National Library of Medicine (NLM) to develop large protein language models for biomedical applications. This marks the first NIH award at UK focusing on AI for protein research.

Shao’s project, “Structure-Function-Aware Large Protein Language Models for Enhanced Biomedical Applications,” aims to develop AI models that can predict critical information about proteins, which could help in understanding diseases and discovering new treatments.

More specifically, these models have the potential to make more accurate predictions about how proteins behave in the body by implementing biochemical and biophysical knowledge of proteins into the language models and developing efficient approaches to train the AI models to learn and make better predictions based on small amounts of data.

“This research teaches AI the knowledge of proteins so it can design proteins with the desired functions or predict protein properties better,” Shao said.

Large protein language models have proven to have a foundational role in biomedical research, but their applications are facing two prominent roadblocks. The first is the absence of critical knowledge of protein structure and function within the existing models. The second is the difficulty in adapting these models for specific biomedical tasks without losing their general ability to work across different types of problems.

Shao noted that understanding the 3D shapes of proteins is a critical missing piece, as their structure plays a key role in determining their functions.

“The current large protein language models inherit from the language models developed for ‘languages’ like English or German,” Shao explained. “They treat protein as another ‘language’ like English but do not consider proteins too much as molecules that possess 3D shapes and can move.”

The solution Shao and his team have proposed in this project is creating AI models that combine both protein 3D structures and sequences to generate better representations for proteins. Computer-based simulations of molecular interactions will be used to create accurate 3D models of proteins and then combined with sequence data to improve predictions about how proteins work.

“Our approach was initiated by considering proteins as molecules,” Shao said. “We are developing methods to implement these pieces of missing knowledge into language models.”

The key to the project’s success, however, will be developing the techniques to represent biochemical and biophysical knowledge of proteins and implement them into the AI models. The deep learning tools this can provide for various biomedical applications would be invaluable to engineering researchers working on treatments for devastating diseases.

“Biomedical researchers would use these models to design the proteins as a drug for diseases such as cancers and Alzheimer’s Diseases,” Shao stated. “The chemical engineers would use this model to design enzymes that convert biomass to useful chemicals.”

In particular, Shao’s models could hold much promise for design protein-based drugs battling cancer, and lead to new methods of treating drug-resistant microorganisms. “It would help researchers discover proteins that cause bacterial resistance and cancer spread,” Shao said. “It would also help researchers design proteins or peptides as new medicines to battle against bacterial resistance and cancers.”

Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM014510. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.