Kreinacke, Marcos and Bekhuis, Tanja and Spallek, Heiko and Song, Mei and O'Donnell, Jean A
(2011)
Preprocessing messages posted by dentists to an Internet mailing list: a report of methods developed for a study of clinical content.
Technical Report.
UNSPECIFIED, Pittsburgh, PA, USA.
(Unpublished)
|
Microsoft Word (Technical Report)
Other
Available under License : See the attached license file.
Download (327kB)
|
|
Plain Text (licence)
Available under License : See the attached license file.
Download (1kB)
|
Abstract
Objectives: Mining social media artifacts requires substantial processing before content analyses. In this report, we describe our procedures for preprocessing 14,576 e-mail messages sent to a mailing list of several hundred dental professionals. Our goal was to transform the messages into a format useful for natural language processing (NLP) to enable subsequent discovery of clinical topics expressed in the corpus. Methods: Preprocessing involved message capture, database creation and import, extraction of multipurpose Internet mail extensions, decoding of encoded text, de-identification, and cleaning. We also developed a Web-based tool to identify signals for noisy strings and sections, and to verify the effectiveness of customized noise filters. We tailored our cleaning strategies to delete text and images that would impede NLP and in-depth content analyses. Before applying the full set of filters to each message, we determined an effective filter order. Results: Preprocessing messages improved effectiveness of NLP by 38%. Sources of noise included personal information in the salutation, the farewell, and the signature block; names and places mentioned in the body of the text; threads with quoted text; advertisements; embedded or attached images; spam- and virus-scanning notifications; auto text parts; e-mail addresses; and Web links. We identified 53 patterns of noise and delivered a set of de-identified and cleaned messages to the NLP analyst. Conclusion: Preprocessing electronic messages can markedly improve subsequent NLP to enable discovery of clinical topics. Keywords: Electronic mail; data processing; natural language processing; dental informatics
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
Monograph
(Technical Report)
|
Status: |
Unpublished |
Creators/Authors: |
|
Centers: |
Other Centers, Institutes, Offices, or Units > Center for Dental Informatics |
Monograph Type: |
Technical Report |
Date: |
26 September 2011 |
Date Type: |
Publication |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Place of Publication: |
Pittsburgh, PA, USA |
Institution: |
University of Pittsburgh School of Medicine |
Department: |
Department of Biomedical Informatics |
Schools and Programs: |
School of Medicine > Biomedical Informatics |
Refereed: |
No |
Uncontrolled Keywords: |
Electronic, mail, social, media, de-identification, text, processing, text, mining, content, analysis, natural, language, processing, dental, informatics, biomedical, informatics |
Related URLs: |
|
Funders: |
Pittsburgh Biomedical Informatics Training Program 5T15LM007059() |
Date Deposited: |
18 Sep 2012 14:24 |
Last Modified: |
31 Jul 2020 19:02 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/14247 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |