Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

CiteData: A new multi-faceted dataset for evaluating personalized search performance

Harpale, A and Yang, Y and Gopal, S and He, D and Yue, Z (2010) CiteData: A new multi-faceted dataset for evaluating personalized search performance. In: UNSPECIFIED.

[img] Plain Text (licence)
Available under License : See the attached license file.

Download (1kB)


Personalized search systems have evolved to utilize heterogeneous features including document hyperlinks, category labels in various taxonomies and social tags in addition to free-text of the documents. Consequently, classifiers, PageR-ank algorithms and Collaborative Filtering methods are often used as intermediate steps in such personalized retrieval systems. Thorough comparative evaluation of such complex systems has been difficult due to the lack of appropriate publicly available datasets that provide such diverse feature sets. To remedy the situation, we have created Cite-Data, a new dataset for benchmark evaluations of personalized search performance, that will be made publicly accessible. CiteData is a collection of academic articles extracted from CiteULike and CiteSeer repositories, with rich feature sets such as authors, author-affiliations, topic labels, social tags and citation information. We further supplement it with personalized queries and relevance judgments which were obtained from volunteer users. This paper starts with a discussion of the design criteria and characteristics of the CiteData dataset in comparison with current benchmark datasets, followed by a set of task-oriented empirical evaluations of popular algorithms in statistical classification, collaborative filtering and link analysis as intermediate steps for personalized search. Our results show significant performance improvement of personalized approaches, over that of unpersonalized approaches. We also observe that a meta personalized search engine that leverages information from multiple sources of features performs better than algorithms that use only one of the constituent source of features. © 2010 ACM.


Social Networking:
Share |


Item Type: Conference or Workshop Item (UNSPECIFIED)
Status: Published
CreatorsEmailPitt UsernameORCID
Harpale, A
Yang, Y
Gopal, S
He, Ddah44@pitt.eduDAH440000-0002-4645-8696
Yue, Z
Date: 1 December 2010
Date Type: Publication
Journal or Publication Title: International Conference on Information and Knowledge Management, Proceedings
Page Range: 549 - 557
Event Type: Conference
DOI or Unique Handle: 10.1145/1871437.1871509
Institution: University of Pittsburgh
Schools and Programs: School of Information Sciences > Library and Information Science
Refereed: Yes
ISBN: 9781450300995
Date Deposited: 28 Jun 2011 18:23
Last Modified: 19 Jun 2019 13:55


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item