Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

The White Method: Towards Automatic Evaluation Metrics for Adaptive Tutoring Systems

Gonzalez-Brenes, Jose and Huang, Yun (2014) The White Method: Towards Automatic Evaluation Metrics for Adaptive Tutoring Systems. In: Proceedings of NIPS 2014 Workshop on Human Propelled Machine Learning.

[img] Plain Text (licence)
Available under License : See the attached license file.

Download (1kB)


Human-propelled machine learning systems are often evaluated with randomized control trials. Unfortunately, such trials can become extremely expensive and time consuming to conduct because they may require institutional review board approvals, experimental design by an expert, recruiting (and often payment) of enough participants to achieve statistical power, and data analysis. Alternatively, automatic evaluation metrics offer less expensive and faster comparisons between alternative systems. The fields that have agreed on automatic metrics have seen an accelerated pace of technological progress. For example, the widespread adoption of the Bleu metric (Papineni et al., 2001) in the machine translation community has lowered the cost of development and evaluation of translation systems. At the same time, the low cost of the Bleu metric has enabled machine translation competitions that result in great advances of translation quality. Similarly, the Rouge metric (Lin and Hovy, 2002) has helped the automatic summarization community transition from expensive user studies of human judgments that may take thousands of hours to conduct, to an automatic metric that can be computed very quickly. We study how to evaluate adaptive intelligent tutoring systems, which are systems that teach and adapt to humans. These systems are complex, and are often made up of many components (Almond et al., 2001), such as a student model, content pool and a cognitive model. We focus on evaluating tutoring systems that adapt the items students should solve, which are questions, problems, or tasks that can be graded individually. These adaptive systems optimize the subset of items to be given to the student according to their historical performance (Corbett and Anderson, 1995), or features extracted from their activities (Gonz´alez-Brenes et al., 2014). Adaptive tutoring implies making a trade-off between minimizing the amount of practice a student is assigned and maximizing her learning gains (Cen et al., 2007). Practicing a skill may improve skill proficiency, at the cost of a missed opportunity for teaching new material. Prior work (Pardos and Yudelson, 2013; Pelnek, 2014; Dhanani et al., 2014) has surveyed different evaluation methods for adaptive systems. A tutoring system is usually evaluated by using a classification evaluation metric to assess its student model, or by a randomized control trial. The student model is a component of the tutoring systems that forecasts whether a student will answer the next item correctly. Popular evaluation metrics for student models include classification accuracy, the Area Under the Curve (AUC) of the Receiver Operating Characteristic curve and, strangely for classifiers, the Root Mean Square Error. As a convention, many authors report as a baseline the performance of a majority classifier– even though this classifier is not a student model that can be translated into a teaching policy. Lee and Brunskill (2012) propose a promising evaluation metric that calculates the expected number of practice opportunities that students require to master the content of the curriculum of the tutoring system. Their method is very successful for its purpose but it is limited to a particular student model called Knowledge Tracing. Their approach requires a researcher to derive the theoretical expected behavior for the student model that is to be evaluated, which is not possible to calculate in general. We propose WHole Intelligent Tutoring system Evaluation (White), a novel automatic method that evaluates the recommendations of an adaptive system. White overcomes the limitations of previous work in tutoring system evaluation by using student data, and by allowing to assess arbitrary student models. White relies on counterfactual simulations: it reproduces the decisions that the tutoring system would have made given the input data on the test set. The input of White is (i) a policy that describes when a subset of items of the tutoring system should be presented to the student, and (ii) the student model predictions of the test set. For each student in the test set, White estimates their counterfactual effort – how many items the student would have solved using the tutoring system. White also calculates a counterfactual score (grade) to represent the student learning. The student effort and score act as a proxy of the design goals of a tutoring system – maximizing learning while minimizing student effort. With some soft assumptions on the tutoring system, White can evaluate a large array of tutoring systems with different student models. Our experiments on real and synthetic data reveal that it is possible to have student models that score highly on predictive performance with traditional classification metrics, yet provide little educational value to the learner. Moreover, when we compare alternative tutoring systems with these classification metrics, we discover that they may favor tutoring systems that require higher student effort with no evidence that students are learning more. That is, when comparing two alternative systems, classification metrics may prefer a suboptimal system. Our results add to the growing body of evidence against classification metrics to evaluate tutoring systems (Beck and Xiong, 2013). White is an evaluation method designed to evaluate tutoring systems on student effort and student learning, and provides a better alternative to assessing adaptive systems.


Social Networking:
Share |


Item Type: Conference or Workshop Item (Paper)
Status: Published
CreatorsEmailPitt UsernameORCID
Gonzalez-Brenes, Jose
Huang, Yunyuh43@pitt.eduYUH43
Date: 2014
Date Type: Publication
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Journal or Publication Title: Proceedings of NIPS 2014 Workshop on Human Propelled Machine Learning
Event Title: Proceedings of NIPS 2014 Workshop on Human Propelled Machine Learning
Event Type: Conference
Institution: University of Pittsburgh
Schools and Programs: Dietrich School of Arts and Sciences > Intelligent Systems
Refereed: Yes
Official URL:
Date Deposited: 27 Aug 2015 18:32
Last Modified: 25 Aug 2017 04:59


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item