Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Liu, K and Chapman, W and Hwa, R and Crowley, RS (2007) Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger. Journal of the American Medical Informatics Association, 14 (5). 641 - 650. ISSN 1067-5027

[img] Plain Text (licence)
Available under License : See the attached license file.

Download (1kB)


Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-based sample selection method to minimize annotated corpus size for retraining a Maximum Entropy (ME) POS tagger. We developed a manually annotated domain specific corpus (DSC) of surgical pathology reports and a domain specific lexicon (DL). We sampled the DSC using two heuristics to produce smaller training sets and compared the retrained performance against (1) the original ME modeled tagger trained on general English, (2) the ME tagger retrained on the DL, and (3) the MedPost tagger trained on MEDLINE abstracts. Results showed that the ME tagger retrained with a DSC was superior to the tagger retrained with the DL, and also superior to MedPost. Heuristic methods for sample selection produced performance equivalent to use of the entire training set, but with many fewer sentences. Learning curve analysis showed that sample selection would enable an 84% decrease in the size of the training set without a decrement in performance. We conclude that heuristic sample selection can be used to markedly reduce human annotation requirements for training of medical NLP systems. © 2007 J Am Med Inform Assoc.


Social Networking:
Share |


Item Type: Article
Status: Published
CreatorsEmailPitt UsernameORCID
Liu, K
Chapman, W
Hwa, Rreh23@pitt.eduREH23
Crowley, RS
Date: 1 September 2007
Date Type: Publication
Journal or Publication Title: Journal of the American Medical Informatics Association
Volume: 14
Number: 5
Page Range: 641 - 650
DOI or Unique Handle: 10.1197/jamia.m2392
Schools and Programs: School of Medicine > Biomedical Informatics
Refereed: Yes
ISSN: 1067-5027
MeSH Headings: Artificial Intelligence; Humans; Linguistics; Natural Language Processing; Pathology, Surgical; Terminology as Topic
Other ID: NLM PMC1975798
PubMed Central ID: PMC1975798
PubMed ID: 17600099
Date Deposited: 29 Aug 2012 20:58
Last Modified: 03 Feb 2019 00:55


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item