Francoeur, Paul Glidden
(2024)
Advancing Machine Learning for Small Molecule Property Prediction.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Recently, machine learning (ML) models have rapidly become the state of the art at various molecular property prediction tasks. The speed of ML models, without sacrificing accuracy, makes them especially attractive in screening contexts, where a large number of potential molecules need to reduced to a number feasible for experimental testing. However, the black box nature and rapid advancement of ML models has resulted in a proliferation of input representations and model architectures. This makes selection of the ``best'' model architecture and input representation for a given task difficult. Additionally, while ML models thrive on having large datasets for training, the amount of labeled structures for properties like receptor-ligand binding affinity is small.
This work sets out to help address these two problems with ML models for molecular property prediction. First, a wide variety of molecular input representations and ML model architectures were trained to predict calculated molecular properties. The characterization of both the performance of these models, and how well they utilize the training data, yields suggestions on how to best select a ML approach for more realistic property prediction tasks, given the amount of compute resources and training data available. Next, in order to address the lack of labeled structural data, a new dataset, CrossDocked2020, was created to expand the PDBbind dataset to expand the available binding pose classification data. By docking ligands into non-cognate, but similar, receptors we were able to expand the ~200,000 poses available from the PDBbind General set into ~22.5 million poses in CrossDocked2020. Various data imputation techniques were then explored to see if they could improve the binding affinity regression of a convolutional neural network (CNN) on CrossDocked2020. The utilization of an ensemble of CNN models to impute the missing binding affinity labels of complexes in CrossDocked2020 had a small, but significant improvement on model performance. Lastly, in order to give further support that the knowledge from this work is applicable in the real world, the CNN developed in this work was utilized to identify a small molecule to disrupt the actin-profilin1 protein-protein binding complex.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
5 March 2024 |
Date Type: |
Publication |
Defense Date: |
21 August 2023 |
Approval Date: |
5 March 2024 |
Submission Date: |
27 October 2023 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
162 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Medicine > Computational Biology |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
Machine Learning, Drug Discovery, Computational Biology, Neural Network, Docking, Transformer, CrossDocking, CNN, Convolutional Neural Networks, Molecular Fingerprints |
Date Deposited: |
05 Mar 2024 18:15 |
Last Modified: |
05 Mar 2024 18:15 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/45455 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |