Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Preprocessing messages posted by dentists to an Internet mailing list: a report of methods developed for a study of clinical content

Kreinacke, Marcos and Bekhuis, Tanja and Spallek, Heiko and Song, Mei and O'Donnell, Jean A (2011) Preprocessing messages posted by dentists to an Internet mailing list: a report of methods developed for a study of clinical content. Technical Report. UNSPECIFIED, Pittsburgh, PA, USA. (Unpublished)

[img] Microsoft Word (Technical Report)
Available under License : See the attached license file.

Download (327kB)
[img] Plain Text (licence)
Available under License : See the attached license file.

Download (1kB)


Objectives: Mining social media artifacts requires substantial processing before content analyses. In this report, we describe our procedures for preprocessing 14,576 e-mail messages sent to a mailing list of several hundred dental professionals. Our goal was to transform the messages into a format useful for natural language processing (NLP) to enable subsequent discovery of clinical topics expressed in the corpus. Methods: Preprocessing involved message capture, database creation and import, extraction of multipurpose Internet mail extensions, decoding of encoded text, de-identification, and cleaning. We also developed a Web-based tool to identify signals for noisy strings and sections, and to verify the effectiveness of customized noise filters. We tailored our cleaning strategies to delete text and images that would impede NLP and in-depth content analyses. Before applying the full set of filters to each message, we determined an effective filter order. Results: Preprocessing messages improved effectiveness of NLP by 38%. Sources of noise included personal information in the salutation, the farewell, and the signature block; names and places mentioned in the body of the text; threads with quoted text; advertisements; embedded or attached images; spam- and virus-scanning notifications; auto text parts; e-mail addresses; and Web links. We identified 53 patterns of noise and delivered a set of de-identified and cleaned messages to the NLP analyst. Conclusion: Preprocessing electronic messages can markedly improve subsequent NLP to enable discovery of clinical topics. Keywords: Electronic mail; data processing; natural language processing; dental informatics


Social Networking:
Share |


Item Type: Monograph (Technical Report)
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Kreinacke, Marcos
Bekhuis, Tanjatcb24@pitt.eduTCB240000-0002-8537-9077
Spallek, Heiko
Song, Mei
O'Donnell, Jean A
Centers: Other Centers, Institutes, Offices, or Units > Center for Dental Informatics
Monograph Type: Technical Report
Date: 26 September 2011
Date Type: Publication
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Place of Publication: Pittsburgh, PA, USA
Institution: University of Pittsburgh School of Medicine
Department: Department of Biomedical Informatics
Schools and Programs: School of Medicine > Biomedical Informatics
Refereed: No
Uncontrolled Keywords: Electronic, mail, social, media, de-identification, text, processing, text, mining, content, analysis, natural, language, processing, dental, informatics, biomedical, informatics
Related URLs:
Funders: Pittsburgh Biomedical Informatics Training Program 5T15LM007059()
Date Deposited: 18 Sep 2012 14:24
Last Modified: 31 Jul 2020 19:02


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item