Advancing Machine Learning for Small Molecule Property Prediction

Francoeur, Paul Glidden (2024) Advancing Machine Learning for Small Molecule Property Prediction. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Download (15MB) | Preview

Abstract

Recently, machine learning (ML) models have rapidly become the state of the art at various molecular property prediction tasks. The speed of ML models, without sacrificing accuracy, makes them especially attractive in screening contexts, where a large number of potential molecules need to reduced to a number feasible for experimental testing. However, the black box nature and rapid advancement of ML models has resulted in a proliferation of input representations and model architectures. This makes selection of the ``best'' model architecture and input representation for a given task difficult. Additionally, while ML models thrive on having large datasets for training, the amount of labeled structures for properties like receptor-ligand binding affinity is small.

This work sets out to help address these two problems with ML models for molecular property prediction. First, a wide variety of molecular input representations and ML model architectures were trained to predict calculated molecular properties. The characterization of both the performance of these models, and how well they utilize the training data, yields suggestions on how to best select a ML approach for more realistic property prediction tasks, given the amount of compute resources and training data available. Next, in order to address the lack of labeled structural data, a new dataset, CrossDocked2020, was created to expand the PDBbind dataset to expand the available binding pose classification data. By docking ligands into non-cognate, but similar, receptors we were able to expand the ~200,000 poses available from the PDBbind General set into ~22.5 million poses in CrossDocked2020. Various data imputation techniques were then explored to see if they could improve the binding affinity regression of a convolutional neural network (CNN) on CrossDocked2020. The utilization of an ensemble of CNN models to impute the missing binding affinity labels of complexes in CrossDocked2020 had a small, but significant improvement on model performance. Lastly, in order to give further support that the knowledge from this work is applicable in the real world, the CNN developed in this work was utilized to identify a small molecule to disrupt the actin-profilin1 protein-protein binding complex.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Francoeur, Paul Glidden	paf46@pitt.edu	paf46	0000-0002-1440-567X

ETD Committee:

Title	Member	Email Address	Pitt Username	ORCID
Thesis Advisor	Koes, David Ryan	dkoes@pitt.edu	dkoes	0000-0002-6892-6614
Committee Chair	Bahar, Ivet	bahar@laufercenter.org		0000-0001-9959-4176
Committee Member	Wang, Junmei	juw79@pitt.edu		0000-0002-9607-8229
Committee Member	Isayev, Olexandr	olexandr@cmu.edu		0000-0001-7581-8497
Committee Member	Walters, Patrick	pwalters@relaytx.com

Date:

5 March 2024

Date Type:

Publication

Defense Date:

21 August 2023

Approval Date:

5 March 2024

Submission Date:

27 October 2023

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

162

Institution:

University of Pittsburgh

Schools and Programs:

School of Medicine > Computational Biology

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Machine Learning, Drug Discovery, Computational Biology, Neural Network, Docking, Transformer, CrossDocking, CNN, Convolutional Neural Networks, Molecular Fingerprints

Date Deposited:

05 Mar 2024 18:15

Last Modified:

05 Mar 2024 18:15

URI:

http://d-scholarship.pitt.edu/id/eprint/45455

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Advancing Machine Learning for Small Molecule Property Prediction

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds