Pitt Logo LinkContact Us

Searching for Entities: When Retrieval Meets Extraction

Li, Qi (2012) Searching for Entities: When Retrieval Meets Extraction. Doctoral Dissertation, University of Pittsburgh.

[img]
Preview
PDF - Primary Text
Download (1455Kb) | Preview

    Abstract

    Retrieving entities from inside of documents, instead of searching for documents or web pages themselves, has become an active topic in both commercial search systems and academic information retrieval research area. Taking into account information needs about entities represented as descriptions with targeted answer entity types, entity search tasks are to return ranked lists of answer entities from unstructured texts, such as news or web pages. Although it works in the same environment as document retrieval, entity retrieval tasks require finer-grained answers entities which need more syntactic and semantic analyses on germane documents than document retrieval. This work proposes a two-layer probability model for addressing this task, which integrates germane document identification and answer entity extraction. Germane document identification retrieves highly related germane documents containing answer entities, while answer entity extraction finds answer entities by utilizing syntactic or linguistic information from those documents. This work theoretically demonstrates the integration of germane document identification and answer entity extraction for the entity retrieval task with the probability model. Moreover, this probability approach helps to reduce the overall retrieval complexity while maintaining high accuracy in locating answer entities. Serial studies are conducted in this dissertation on both germane document identification and answer entity extraction. The learning to rank method is investigated for germane document identification. This method first constructs a model on the training data set using query features, document features, similarity features and rank features. Then the model estimates the probability of the germane documents on testing data sets with the learned model. The experiment indicates that the learning to rank method is significantly better than the baseline systems, which treat germane document identification as a conventional document retrieval problem. The answer entity extraction method aims to correctly extract the answer entities from the germane documents. The methods of answer entity extraction without contexts (such as named entity recognition tools for extraction and knowledge base for extraction) and answer entity extraction with contexts (such as tables/lists as contexts and subject-verb-object structures as contexts) are investigated. These methods individually, however, can extract only parts of answer entities. The method of treating the answer entity extraction problem as a classification problem with the features from the above extraction methods runs significantly better than any of the individual extraction methods.


    Share

    Citation/Export:
    Social Networking:

    Details

    Item Type: University of Pittsburgh ETD
    ETD Committee:
    ETD Committee TypeCommittee MemberEmailORCID
    Committee ChairHe, Daqingdah44@pitt.edu
    Committee MemberSpring, Michael
    Committee MemberMunro, Paul
    Committee MemberOh, Jung Sunjsoh@sis.pitt.edu
    Committee MemberTsui, Fu
    Title: Searching for Entities: When Retrieval Meets Extraction
    Status: Published
    Abstract: Retrieving entities from inside of documents, instead of searching for documents or web pages themselves, has become an active topic in both commercial search systems and academic information retrieval research area. Taking into account information needs about entities represented as descriptions with targeted answer entity types, entity search tasks are to return ranked lists of answer entities from unstructured texts, such as news or web pages. Although it works in the same environment as document retrieval, entity retrieval tasks require finer-grained answers entities which need more syntactic and semantic analyses on germane documents than document retrieval. This work proposes a two-layer probability model for addressing this task, which integrates germane document identification and answer entity extraction. Germane document identification retrieves highly related germane documents containing answer entities, while answer entity extraction finds answer entities by utilizing syntactic or linguistic information from those documents. This work theoretically demonstrates the integration of germane document identification and answer entity extraction for the entity retrieval task with the probability model. Moreover, this probability approach helps to reduce the overall retrieval complexity while maintaining high accuracy in locating answer entities. Serial studies are conducted in this dissertation on both germane document identification and answer entity extraction. The learning to rank method is investigated for germane document identification. This method first constructs a model on the training data set using query features, document features, similarity features and rank features. Then the model estimates the probability of the germane documents on testing data sets with the learned model. The experiment indicates that the learning to rank method is significantly better than the baseline systems, which treat germane document identification as a conventional document retrieval problem. The answer entity extraction method aims to correctly extract the answer entities from the germane documents. The methods of answer entity extraction without contexts (such as named entity recognition tools for extraction and knowledge base for extraction) and answer entity extraction with contexts (such as tables/lists as contexts and subject-verb-object structures as contexts) are investigated. These methods individually, however, can extract only parts of answer entities. The method of treating the answer entity extraction problem as a classification problem with the features from the above extraction methods runs significantly better than any of the individual extraction methods.
    Date: 04 January 2012
    Date Type: Publication
    Defense Date: 07 October 2011
    Approval Date: 04 January 2012
    Submission Date: 08 November 2011
    Release Date: 04 January 2012
    Access Restriction: No restriction; The work is available for access worldwide immediately.
    Patent pending: No
    Number of Pages: 170
    Institution: University of Pittsburgh
    Thesis Type: Doctoral Dissertation
    Refereed: Yes
    Degree: PhD - Doctor of Philosophy
    Uncontrolled Keywords: Entity Retrieval, Information Retrieval, Entity Extraction
    Schools and Programs: School of Information Sciences > Information Science
    Date Deposited: 04 Jan 2012 11:20
    Last Modified: 16 Jul 2014 17:02

    Actions (login required)

    View Item

    Document Downloads