Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Advancing Machine Learning for Small Molecule Property Prediction

Francoeur, Paul Glidden (2024) Advancing Machine Learning for Small Molecule Property Prediction. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Download (15MB) | Preview


Recently, machine learning (ML) models have rapidly become the state of the art at various molecular property prediction tasks. The speed of ML models, without sacrificing accuracy, makes them especially attractive in screening contexts, where a large number of potential molecules need to reduced to a number feasible for experimental testing. However, the black box nature and rapid advancement of ML models has resulted in a proliferation of input representations and model architectures. This makes selection of the ``best'' model architecture and input representation for a given task difficult. Additionally, while ML models thrive on having large datasets for training, the amount of labeled structures for properties like receptor-ligand binding affinity is small.

This work sets out to help address these two problems with ML models for molecular property prediction. First, a wide variety of molecular input representations and ML model architectures were trained to predict calculated molecular properties. The characterization of both the performance of these models, and how well they utilize the training data, yields suggestions on how to best select a ML approach for more realistic property prediction tasks, given the amount of compute resources and training data available. Next, in order to address the lack of labeled structural data, a new dataset, CrossDocked2020, was created to expand the PDBbind dataset to expand the available binding pose classification data. By docking ligands into non-cognate, but similar, receptors we were able to expand the ~200,000 poses available from the PDBbind General set into ~22.5 million poses in CrossDocked2020. Various data imputation techniques were then explored to see if they could improve the binding affinity regression of a convolutional neural network (CNN) on CrossDocked2020. The utilization of an ensemble of CNN models to impute the missing binding affinity labels of complexes in CrossDocked2020 had a small, but significant improvement on model performance. Lastly, in order to give further support that the knowledge from this work is applicable in the real world, the CNN developed in this work was utilized to identify a small molecule to disrupt the actin-profilin1 protein-protein binding complex.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Francoeur, Paul Gliddenpaf46@pitt.edupaf460000-0002-1440-567X
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Thesis AdvisorKoes, David Ryandkoes@pitt.edudkoes0000-0002-6892-6614
Committee ChairBahar, Ivetbahar@laufercenter.org0000-0001-9959-4176
Committee MemberWang, Junmeijuw79@pitt.edu0000-0002-9607-8229
Committee MemberIsayev, Olexandrolexandr@cmu.edu0000-0001-7581-8497
Committee MemberWalters,
Date: 5 March 2024
Date Type: Publication
Defense Date: 21 August 2023
Approval Date: 5 March 2024
Submission Date: 27 October 2023
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 162
Institution: University of Pittsburgh
Schools and Programs: School of Medicine > Computational Biology
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Machine Learning, Drug Discovery, Computational Biology, Neural Network, Docking, Transformer, CrossDocking, CNN, Convolutional Neural Networks, Molecular Fingerprints
Date Deposited: 05 Mar 2024 18:15
Last Modified: 05 Mar 2024 18:15


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item