Liu, K and Chapman, W and Hwa, R and Crowley, RS
(2007)
Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger.
Journal of the American Medical Informatics Association, 14 (5).
641 - 650.
ISSN 1067-5027
![[img]](http://d-scholarship.pitt.edu/style/images/fileicons/text_plain.png) |
Plain Text (licence)
Available under License : See the attached license file.
Download (1kB)
|
Abstract
Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-based sample selection method to minimize annotated corpus size for retraining a Maximum Entropy (ME) POS tagger. We developed a manually annotated domain specific corpus (DSC) of surgical pathology reports and a domain specific lexicon (DL). We sampled the DSC using two heuristics to produce smaller training sets and compared the retrained performance against (1) the original ME modeled tagger trained on general English, (2) the ME tagger retrained on the DL, and (3) the MedPost tagger trained on MEDLINE abstracts. Results showed that the ME tagger retrained with a DSC was superior to the tagger retrained with the DL, and also superior to MedPost. Heuristic methods for sample selection produced performance equivalent to use of the entire training set, but with many fewer sentences. Learning curve analysis showed that sample selection would enable an 84% decrease in the size of the training set without a decrement in performance. We conclude that heuristic sample selection can be used to markedly reduce human annotation requirements for training of medical NLP systems. © 2007 J Am Med Inform Assoc.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
Article
|
Status: |
Published |
Creators/Authors: |
Creators | Email | Pitt Username | ORCID  |
---|
Liu, K | | | | Chapman, W | | | | Hwa, R | reh23@pitt.edu | REH23 | | Crowley, RS | | | |
|
Date: |
1 September 2007 |
Date Type: |
Publication |
Journal or Publication Title: |
Journal of the American Medical Informatics Association |
Volume: |
14 |
Number: |
5 |
Page Range: |
641 - 650 |
DOI or Unique Handle: |
10.1197/jamia.m2392 |
Schools and Programs: |
School of Medicine > Biomedical Informatics |
Refereed: |
Yes |
ISSN: |
1067-5027 |
MeSH Headings: |
Artificial Intelligence; Humans; Linguistics; Natural Language Processing; Pathology, Surgical; Terminology as Topic |
Other ID: |
NLM PMC1975798 |
PubMed Central ID: |
PMC1975798 |
PubMed ID: |
17600099 |
Date Deposited: |
29 Aug 2012 20:58 |
Last Modified: |
03 Feb 2019 00:55 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/13805 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Altmetric.com
Actions (login required)
 |
View Item |