A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

Seo, Songwon (2006) A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets. Master's Thesis, University of Pittsburgh. (Unpublished)

Preview

PDF
Primary Text
Download (557kB) | Preview

Abstract

Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Seo, Songwon	songwon@gmail.com

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Marsh, Gary M.	gmarsh@pitt.edu	GMARSH
Committee Member	Cassidy, Laura	lcs3@pitt.edu	LCS3
Committee Member	Sharma, Ravi K.	rks1946@pitt.edu	RKS1946

Date:

9 August 2006

Date Type:

Completion

Defense Date:

26 April 2006

Approval Date:

9 August 2006

Submission Date:

25 May 2006

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

MS - Master of Science

Thesis Type:

Master's Thesis

Refereed:

Yes

Uncontrolled Keywords:

boxplot; lognormal; outlier; skewed distribution

Other ID:

http://etd.library.pitt.edu/ETD/available/etd-05252006-081925/, etd-05252006-081925

Date Deposited:

10 Nov 2011 19:45

Last Modified:

15 Nov 2016 13:43

URI:

http://d-scholarship.pitt.edu/id/eprint/7948

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds