Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

Seo, Songwon (2006) A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets. Master's Thesis, University of Pittsburgh. (Unpublished)

[img]
Preview
PDF
Primary Text

Download (557kB) | Preview

Abstract

Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: University of Pittsburgh ETD
Status: Unpublished
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Seo, Songwonsongwon@gmail.com
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairMarsh, Gary M.gmarsh@pitt.eduGMARSH
Committee MemberCassidy, Lauralcs3@pitt.eduLCS3
Committee MemberSharma, Ravi K.rks1946@pitt.eduRKS1946
Date: 9 August 2006
Date Type: Completion
Defense Date: 26 April 2006
Approval Date: 9 August 2006
Submission Date: 25 May 2006
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Institution: University of Pittsburgh
Schools and Programs: Graduate School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: boxplot; lognormal; outlier; skewed distribution
Other ID: http://etd.library.pitt.edu/ETD/available/etd-05252006-081925/, etd-05252006-081925
Date Deposited: 10 Nov 2011 19:45
Last Modified: 15 Nov 2016 13:43
URI: http://d-scholarship.pitt.edu/id/eprint/7948

Metrics

Monthly Views for the past 3 years

Plum Analytics


Actions (login required)

View Item View Item