A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

Day, RS and McDade, KK (2013) A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration. BMC bioinformatics, 14. 223 - ?.

Preview

PDF
Published Version
Available under License : See the attached license file.
Download (1MB) | Preview

Plain Text (licence)
Available under License : See the attached license file.
Download (1kB)

Abstract

In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

Article

Status:

Published

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Day, RS	day01@pitt.edu	DAY01
McDade, KK	kkm5@pitt.edu	KKM5

Date:

1 January 2013

Date Type:

Publication

Journal or Publication Title:

BMC bioinformatics

Volume:

Page Range:

223 - ?

DOI or Unique Handle:

10.1186/1471-2105-14-223

Schools and Programs:

School of Public Health > Biostatistics
School of Medicine > Biomedical Informatics

Refereed:

Yes

Date Deposited:

07 Oct 2016 15:58

Last Modified:

05 Feb 2019 03:55

URI:

http://d-scholarship.pitt.edu/id/eprint/29702

Metrics

Monthly Views for the past 3 years

Plum Analytics

Altmetric.com

Actions (login required)

View Item

My Account

Search

Browse

Information

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Altmetric.com

Actions (login required)

Connect with us

Send Comments or Questions

Feeds