Seo, Songwon
(2006)
A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets.
Master's Thesis, University of Pittsburgh.
(Unpublished)
Abstract
Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data. Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
9 August 2006 |
Date Type: |
Completion |
Defense Date: |
26 April 2006 |
Approval Date: |
9 August 2006 |
Submission Date: |
25 May 2006 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Public Health > Biostatistics |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
boxplot; lognormal; outlier; skewed distribution |
Other ID: |
http://etd.library.pitt.edu/ETD/available/etd-05252006-081925/, etd-05252006-081925 |
Date Deposited: |
10 Nov 2011 19:45 |
Last Modified: |
15 Nov 2016 13:43 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/7948 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |