Iterative Discriminant Tensor Factorization for Behavior Comparison in Massive Open Online Courses

The increasing utilization of massive open online courses has significantly expanded global access to formal education. Despite the technology's promising future, student interaction on MOOCs is still a relatively under-explored and poorly understood topic. This work proposes a multi-level pattern discovery through hierarchical discriminative tensor factorization. We formulate the problem as a hierarchical discriminant subspace learning problem, where the goal is to discover the shared and discriminative patterns with a hierarchical structure. The discovered patterns enable a more effective exploration of the contrasting behaviors of two performance groups. We conduct extensive experiments on several real-world MOOC datasets to demonstrate the effectiveness of our proposed approach. Our study advances the current predictive modeling in MOOCs by providing more interpretable behavioral patterns and linking their relationships with the performance outcome.


INTRODUCTION
While massive open online courses (MOOCs) have been attracting an ever-increasing number of students, the low completion rate (between 5%-10% [19]) has been a major obstacle to the transformative potentials of MOOCs. The predictive analysis of student performance thus emerged an an important research topic offering insights to platform developers and instructors in arranging proper learning support and allocating resources to students. To find informative predictors, researchers have focused on extracting features from students' interaction with the MOOC platform, such as watching videos, working on assignments, and viewing or contributing to discussion forums. Applied predictive models range from standard machine learning methods [23,28] to more advanced ones such as deep learning [7]. These prediction models could be useful in predicting learning outcomes but are notably limited in helping understand the underlying learning behavior. There is abundant work that aims to better understand the behavior patterns that relate to the learning outcomes. For example, Coleman et al. [4] correlates each "behavior topic" to the learning outcomes based on topic modeling. However, the research to date fails to consider the multi-dimensional nature of the features and their potential interactions in outcome learning. Meanwhile, the famous Simpson's paradox points out that the direction of an association at the population-level may be reversed within the subgroups comprising that population [20]. To further explain this in the context of the MOOC platform, we use the Edx MOOC dataset [14] and investigate the factors associated with the students' grade on it.
We extract the education, area (the region where the student is from), the number of active days and the final grade of each student in the course "Introduction to Computer Science and Programming" offered by MITx in Spring 2013. The course had more than 44,000 online participants. Figure 1 shows the association between the selected factors and the final grades of the students. By comparing Figure 1(a) and Figure 1(b), we observe that when the factor of area is considered, mixed associations occur. For example, the subgroup of Indian participants exhibited a negative association between education level and grade, which potentially suggests a more conservative understanding of the relationship between educational background and course outcome. Moreover, when we consider the number of active days, as in Figure 1(c), we notice that for the participants from the "other European" area, the positive association has a strong presence-but only with ShortestStay and LongestStay. Thus, in comparison to typical correlation analysis, a prediction model that can take advantage of the multi-way interactions of the features could potentially yield better performance.
A growing body of research seeks to resolve Simpson's paradox through causality inference [34]. However, causality is not the focus of this study as we aim for a data-driven approach to subgroup comparisons and explorations. In many cases, this can be interesting and important even in non-causal settings. A straightforward solution is to perform regression with feature interactions, or use Factorization Machines [37] that allow for the estimation of high-order interaction effects. However, the drawback of these methods is that they offer little understanding of the underlying multi-way learning behavior dynamics and their relationship to the learning outcomes. On the other hand, Factorization models like Matrix Factorization (2-way) and Tensor Factorization (m-way) are able to provide an in-depth understanding of meaningful behavior dynamics [9], but there are a few drawbacks preventing them from being more widely adopted by researchers in the field. First, the associations are isolated, with each of them capturing a certain trend of the behaviors separately (e.g., Figure 2(c)). Second, conventional pattern discovery through factorization models provides little support for contrasting pattern exploration that aims to identify the shared and discriminative behavior characteristics among different groups of users. Being able to do this can tremendously improve knowledge of user behaviors in the context of user group analysis.
In this work, we formulate the problem of understanding learning behavior in MOOCs as (1) the simultaneous factorization of the association between students' multi-aspect features and their performance, and (2) the iterative discovery of interpretable shared and discriminative patterns at multiple levels. The critical challenge is how to utilize the multi-way interaction of the features while providing interpretable patterns to help domain experts understand the learning dynamics. We propose a tensor-based learning methoditerative Discriminative tensor factorization (iDisc)-that discovers the common and discriminative learning patterns at multiple levels, and based on which we project users to a latent space (i.e. embedding for the downstream prediction tasks) to identify the association between the multi-way interaction of the features and the students' performance. To this end, we first represent the behaviors of the students from the opposite performance groups as coupled tensors. Since the coarse-grained joint factorization of these behavior tensors may not be capable of revealing behavior patterns at the subgroup level, iDisc iteratively performs discriminative pattern discovery at multiple levels. To increase the interpretability of the entire pattern space, we also introduce the inference of pattern hierarchy. To make the solution capable of handling unseen students, we project the students' behavior tensors into a latent space, by considering the multi-way interactions at different levels as the loading matrix. The empirical studies with the dataset from different MOOC platforms have shown the promising results on the effectiveness and efficiency of iDisc. Contributions Our contributions can be summarized as follows: • We formulate the problem of identifying the multi-way feature interaction with interpretable pattern discovery for understanding user behavior on the MOOC platforms. • We propose a framework of iterative discriminant factorization for multi-way data. By factorizing the residual tensors at each level, our method enables the discovery of common and discriminative patterns at different granular levels. To ensure the parsimony of the discovered structure, we employ sparse learning to effectively capture enforcing relationships between the top-level and bottom-level patterns. • We perform extensive experimentation of our methodology using several real-world datasets, and show the efficiency and interpretability of our proposed method.

RELATED WORK
Predictive Modeling in MOOCs. There are several types of predictive models in MOOCs that are closely related to this work. One direction is to utilize more complex feature types, including higher-order n-gram representations of learner activity data. For instance, features are constructed using the occurrence of pre-defined sequential activities [44], or from sequential pattern mining [11,17,26,40]. Another line of work proposes to utilize the temporal nature of the activity data for student success prediction. Qiu et al. [36] propose a latent dynamic factor graph (LadFG) to model and predict learning behavior in MOOCs. LadFG captures the dynamic information and homophily correlations between students. It also projects students' learning behavior into a latent continuous space for predicting student performance. Another approach is the latent variable modeling as a way of inferring complex relationships between predictors [13,25,27]. For instance, Halawa et al. [13] explore the use of count-based learning activity features to predict dropout; this approach suggests that both observable learner activity and dropout are driven by latent, unobservable "persistence" factors. Common and Discriminative Subspace Learning. The increasing availability of data from diverse sources has enabled the study on joint analysis of heterogeneous data. Gupta et al. [12] propose a joint Non-negative Matrix Factorization (NMF) approach to separate the common and discriminative subspace. Following the same idea, Kim et al. [22] relax the framework by requiring the shared subspace to be similar rather than strictly identical. Wen et al. [46,47] propose a data-driven method of jointly factorizing the paired tensors with auxiliary tensors that preserve the common and distinctive signals. However, existing works focus primarily on pattern discoveries. This limits their use in downstream tasks (e.g., prediction or classification).
Multi-Level Tensor Factorization. Multi-Level Tensor Factorization addresses the problem of approximating the hierarchical lowrank tensor format. This process allows the representation of the tensors in a nested subspace, in one of Tree-Tucker format [30], tensor train format [29], or tensor networks format [3]. Huang et al. [18] employ a tree-guided learning via tensor decomposition and matrix factorization in the context of experts recommendation in multiple areas simultaneously. However, there is limited research that discovers hierarchical nested subspace in the tensor subspace [31]. Özdemir et al. [31] construct a data-dependent multiscale subspace to better represent the data. To do so, the authors first construct a tree structure by partitioning the tensor into a collection of permuted sub-tensors, and then construct the multi-scale subspace by applying HoSVD to each sub-tensor. Summary. The existing predictive analytics on MOOCs considers various ways of constructing a matrix-based feature space. We argue that tensors could be a more suitable representation for student behavioral modeling due to their flexibility in representing the multi-way interaction of the behavioral data. In this regard, Sahebi et al. [39] have shown success in using a tensor-based approach to model the students' learning process, and predict student performance. However, the multi-way interactions as behavior patterns have not been discussed, and a more interpretable pattern discovery that can support a comprehensive understanding of student behaviors is missing. On the other hand, most hierarchical tensor factorization methods tend to recursively decompose the tensor modes by a pre-specified dimension tree [3,29]. Our work, instead, is closer to the multi-level tensor factorization approach by [32], which recursively factorizes the residual tensors to obtain a multi-level representation of the subspace. However, to the best of our knowledge, there has been no work yet that discovers the common and discriminative patterns at multiple levels, especially in the area of predictive modeling in educational data mining.

PROBLEM FORMULATION
In this section, we start with a brief introduction to tensor notions and operations and then formulate the problem considered in this study. Table 1 summarizes the notations used in this paper.

Tensors.
A tensor is a multidimensional array. Let x denote a scalar, x a vector, X a matrix, and then X is the extension of these concepts to higher dimensions. The order of the tensor is the number of modes (or ways) in the tensor.

Tensor
Operations. The basic tensor operations that we use in this study are: Mode-n unfolding, or matricization, is the process of re-ordering elements of tensor X to form a matrix X (n) , where n is the dimension index upon which the unfolding performs. In the case of the three-way tensor X ∈ R I 1 ×I 2 ×I 3 , its matricization in the first mode can be denoted as X (1) ∈ R I 1 ×(I 2 ×I 3 ) .

Problem Formulation with Tensors
To motivate the problem in the context of a real-world dataset, we discuss the application of NMF, Non-negative Tensor Factorization (NTF), discriminative NTF, and hierarchical NTF with a toy dataset. This dataset was extracted from one of the most popular courses in the XueTangX dataset (full dataset details in Section 5.1). Each student event, or activity, is associated with three attributes: time (d 1 , d 2 ), source (s 1 , s 2 ), and type (t 1 , t 2 , . . . , t 7 ).

NMF.
Let matrix X denote the aggregated activities that users have been recorded engaging in on the XueTangX platform for this course. The nature of matrix X is a two-dimensional array, which restricts its capability of integrating further information [38]. In this way, we can either drop one of the attributes or force the third dimension to be combined with the second dimension. where (type+type) can be considered as a repeated vector to jointly represent the event activity and the day.
With the behavior described by X, the bottom part of Figure 2(ab) show the respective low-rank factor matrices approximated by NMF, the source factors (left), and the type factors (right), since they provide the low-dimensional representations of each source and each activity, respectively. Compared to X ′ , X ′′ has the additional advantage of revealing the low-dimensional representation of each activity on different days.

NTF.
Alternatively, we can use a tensor to represent the same dataset (Figure 2(c)). With the given data of a ternary relation nature [43], we could use a third-order tensor X to denote a source×day×type activity.
NTF techniques can be applied to obtain three low-dimensional representations: source factors, day factors, and type factors, as X ≈ S, D, T . As a result, each pattern comes with a set: a betweenactivity vector t to describe the activity dynamics; a between-source vector s to describe the usage tendency between different sources; and an across-day vector d to describe the temporal dynamics (e.g., p 1 in Figure 2(c)). Compared to factorizing the unfolding matrix X ′′ (Figure 2(b)), NTF introduces the day-specific factors. This significantly increases the presentation capability of the patterns by revealing a more direct across-day (temporal) dynamics. Each pattern now represents the interplay of three factors, describing the tendency from different perspectives. With the rich attributes in the behavioral dataset on MOOCs, we thus use tensors to model the behaviors, with the hope that doing so can provide behavioral patterns with the interpretations from different aspects.

Discriminant NTF.
Standard NTF provides meaningful patterns for simultaneously analyzing the behaviors from multiple aspects. This substantially increases the capability of studying and interpreting the behaviors on MOOC platforms with rich dynamics. However, this still does not sufficiently serve the desire to comprehensively understand and investigate student behaviors on these platforms. For example, one of the most interesting questions is which behavior patterns are shared by completed students and dropout students, and which differentiate the two groups (Figure 2(d)). Through understanding the commonality and differences, researchers can better design course interactions and content to help more students successfully complete the course.
One could easily use NTF to fit the behavior tensors from each group of users, separately. However, this approach does not take advantage of any shared behavior patterns between the two groups. As the behavior moves to high-dimensional tensor space, this could potentially lead to the under-fitting problem. Besides, with patterns generated for each data tensor separately, it needs to perform an additional post-hoc analysis to determine common and discriminative patterns. This is a non-trivial attempt to align the common and discriminative patterns, in the case of each pattern being represented by multiple vectors from different aspects. In this regard, discriminative NTF is set to jointly factorize the tensors constructed from different groups of users with the following objective considering CP decomposition: where Ω(·) is the function to promote the simultaneous discovery of the common and discriminative patterns [22,47].

Hierarchical NTF.
Previous works on NTF or discriminative NTF for unsupervised pattern discovery focus on finding a set of patterns at equal granularity, or in a flat structure. Although they are adequately expressive to reveal the behavior dynamics from different aspects, they can not provide the relations between patterns (such as parent-child and sibling relations). A hierarchical non-negative tensor factorization (HierNTF) is more desirable than a set of "flat" patterns, because one can work with the pattern exploration in a hierarchy. As opposed to going through each pattern individually, this results in more efficient pattern understanding. HierNTF can be analogous to hierarchical topic modeling, such as hierarchical Latent Dirichlet Allocation (hLDA) [10], where patterns at higher levels in the hierarchy present "abstract" behavior topics, and ones at lower levels reveal more "specific" behavior topics.

Problem Statement.
Our problem falls into the combination of the discriminative NTF and HierNTF. We would like to identify common and discriminative patterns nested at multiple levels for a deeper understanding of the relationship between students' multiway behavior dynamics and their course performance. Before we give the formulation of the studied problem, we would like to first clarify some basic concepts used later. Definition 3.1. Individual Behavior Tensor. Let X (u ) ∈ R I 1 ×I 2 ···×I M be an M-way tensor representing an individual user u, with each entry in the X (u ) being an activity from user u that is jointly described by M attributes.
The attributes can be data or platform dependent, such as a timevarying attribute tensor constructed from demographic and behavior attributes associated with users at different time stamps [36]. The individual behavior tensor can be considered as a multi-way representation of each student. With that, for each performance group, we can compute the collective behavior tensor.
Definition 3.2. Collective Behavior Tensor. Let X c ∈ R I 1 ×I 2 ···×I M be an M-way tensor that users M attributes to describe the collective activities from a group of users indexed by c.
Students at each performance group can be jointly represented by a collective behavior tensor that captures the full multi-way feature interactions of their activities. Then, we can combine the tensors for the two opposite performance groups to construct the coupled tensors. The coupled tensors X can be constructed in various ways, depending on the performance metric selected, such as c ∈ {dropouts, completion} or c ∈ {certificates, no-certificates}. Inspired by [35], we construct the coupled tensor as follows: where U c is the subset of users u with u belonging to class c. While each individual behavior tensor captures the full-order feature interactions explained by her activities, the full interplay between the M attributes for each group of students can be contained within the tensor structure corresponding to the group. Definition 3.4. Multi-way Behavior Pattern. A multi-way behavior pattern is a a collection of M vectors (x (1) , x (2) , . . . , x (M ) ), M ≥ 2, where x (m) ∈ R I m is a vector to describe the pattern with the m-th attribute.
Definition 3.5. Pattern Hierarchy. Let P l ∈ R R l × R l −1 denote a pattern hierarchy that specifies the relationships between the behavior patterns in the consecutive levels, i.e., level l and level l − 1, where 1 ≤ l ≤ L. P l can be considered as a projection matrix that maps the pattern from the l-th level to the ones at the (l − 1)-th level.
Definition 3.6. Tensor-based User Embedding. A tensor-based user embedding v ∈ R R is the student's vector representation that preserves each student's behavior tensor with a lower-dimensional feature space R R , given the tensor's rank R.
With the definitions above, we define the iterative common and discriminative pattern analysis as follows: Problem 1. Given a set of individual behavior tensors X (u ) ∈ R I 1 ×I 2 ···×I M , (u = 1, 2, . . . , N ) corresponding to C categories, C = 2, and a set of unseen test data X (t ) ∈ R I 1 ×I 2 ···×I M , (t = 1, 2, . . . ,T ), our goal is to iteratively identify a set of patterns that reveal the  Figure 2: Comparison between NMF, NMF on unfolding Matrix, NTF and discriminant NTF.
common and discriminative behaviors at multiple levels, and then use the learned patterns as bases to infer the user embeddings for prediction of the group membership from the unseen students. Specifically, the task is threefold: (1) collective behavior pattern inference for discovering the common and discriminative patterns; (2) iterative pattern discovery at multiple levels with hierarchy; and (3) embedding projection for test samples based on the iterative patterns for classification. In other words, the first two tasks are aimed at revealing interpretable patterns that could explain the interplays between the behavior attributes, and the last task is to discover the relationship between student performance and the multi-way patterns.

SOLUTIONS
In this section, we introduce an iterative tensor factorization method named iDisc for the coupled tensors, X = {X c }. Figure 3 illustrates the overview of iDisc. There are two-stages: (1) iterative application of discriminative tensor subspace learning; and (2) representation learning for the unseen student based on the multi-level patterns.

Iterative Discriminative Tensor Subspace Learning
This component iteratively applies the following two-step approach: (1) discriminant low-rank tensor approximation, followed by (2) computing and passing the residual tensor into the next level.
Through Eq. 3, we could obtain a set of independent factor matrices (or behavior patterns) for each of the performance groups, respectively. However, this does not consider the commonality and differences among the coupled tensors. In order to take the commonality between the two behavior tensors into consideration and allow discrimination against each other, we introduce two sets of auxiliary tensors S l c and Z l c that capture the shared behavior and the discriminative behaviors among the coupled tensors. Inspired by [47], S l c and Z l c are computed based on the coupled tensors X with the clamping function (Eq. 7 and Eq. 9 in [47]). The rational behind the auxiliary tensors is that; we would like to have discriminative tensors that contain only the unique signals for each class, and common tensors to hold what is shared among the coupled tensors. A collective tensor factorization framework is then leveraged to jointly factorize the coupled tensors and the auxiliary tensors as follows: Loss for Auxiliary Tensor Factorization Loss for Pattern Hierarchy where: • W Z and W S are the core tensors with super diagonal entries; • f · is the function to enforce the similar components to be aligned correspondingly and defined as: • g · is the function that learns the shared mapping P l by the coupled tensors, between the patterns at the consecutive levels. We can consider this operation as performing a matrix decomposition from U l −1 c to U l c and P l as: assuming that we already have the values of U l −1 c for l-th level pattern discovery; • h · is an L1 penalty function that encourages sparsity in P l to promote the exclusive mapping between the factor matrices at the consecutive levels. Considering a more interpretable pattern hierarchy, we use L1-norm, since it can function as a proxy for the L0 norm, to minimize the number of nonzero elements while maintaining the convexity of the cost function when estimating P the others fixed. In this manner, we ensure the higher-level patterns are mapped to exclusive lower-level ones; and • λ 0 , λ 1 , λ 2 , λ 3 are the respective parameters to weigh in each objective.
With Eq. 4, common and discriminative patterns discovery at the l-th level becomes an optimization objective as: where θ l = {U, W Z , W S , P} l c .

Obtain the residual tensors.
Once θ l is determined, the reconstructed tensor can be obtained by:   Figure 3: The overview of iDisc's workflow. and therefore the residual tensor can be computed as: Let X l +1 c denote the tensors for the identification of common and discriminative patterns at the next level l + 1. We first obtain X l +1 c as: where E l c is the residual tensor as aforementioned. With X l +1 c , we can further identify the common and discriminative patterns at level l + 1 with Eq. 4. c } l . We use U c to denote the set of factor matrices for X c that correspond to modes other than m, and c to denote the class that is not c.
Since objective function J is not convex with respect to θ , we aim to find a local minimum for J by iteratively updating each in θ .
1. Update U c , fix others. The optimization of U c is equivalent to the following least squares loss functions [47]: where X c is the mode-m unfolding of tensor X c . Then the gradient update of U c can be computed as: 2. Update W Z c , fix others. Let w Z c denote w for tensor X c . The optimization of w Z c is equivalent to the following problem: where Λ w Zc ∈ R R×R×R is the tensor with w Z c as its super-diagonal entries. The gradient update of w c is: 3. Update W S c , fix others. Let w S c denote w for tensor X c . The optimization of w S c is equivalent to the following problem: The gradient update of w S c can be derived as: 4. Update P, fix others. The optimization of P is equivalent to a co-regularized collective matrix factorization problem with sparsity constraints [6,41]: Since the sparsity is applied to each row of P, each row P (r,:) can be updated based on the following gradient: The details of iDisc are summarized in Alg. 1 1 .

Time Complexity Analysis.
The time complexity is mainly consumed by updating each factor matrix U c in iDisc from computing ▽ U c J . From Eq. 11, we need to compute U c (⊙U c ) T (⊙U c ) and X c (⊙U c ) in the first term. Note U c ∈ R I m ×R and X c ∈ R I m × i m I i and therefore we have U c (⊙U c ) T (⊙U c ) ∈ R I m ×R and X c (⊙U c ) ∈ R I m ×R . The operation of matricized tensor times Khatri-Rao product (X c (⊙U c )) is often considered a bottleneck for CP decomposition due to the expensive computational cost [48]. In practice, the sparsity of the tensor is leveraged for an efficient computation for this operation [2,48]. Particularly, the complexity can be reduced by only considering the computation for nonzero observations in X c . Let x h denote the h-th nonzero observation in X c and its subscripts in X c as (I 1 h , I 2 h , . . . , I M h ). If there are H non-zeros, i.g, H = nnz(X c ), we would just need an H -vector to store the real values of X c . In this case, the element-wise computation for X c (⊙U c ) can be written as: where the computation of Khatri-Rao product can be ignored when x h = 0. Therefore, the time complexity for computing X c (⊙U c ) for each mode per iteration is O(nnz(X c )I m R). The element-wise of U c (⊙U c ) T (⊙U c ) can be efficiently computed as [1,42]:

Embedding Learning for the Unseen Student
This section explains the inference of the student's embedding in the latent space anchored by the factor matrices. The individual behavior tensor X t for unseen student t with an unknown class is constructed based on his or her logs with the system. The mode settings of tensor X t are the same as for the collective behavior tensors (e.g., X c ). With the students' individual tensor and the 1 Code is available at https://github.com/picsolab/iDisc factor matrices, we follow iDisc to first obtain the corresponding auxiliary Z tensors at l-th level, Z l tĉ , ∀ĉ ∈ {c,c}, for student t, by following the clapping function in Eq. 7 in [47]. Then, the equation to compute the embedding becomes: which follows a typical computation of the core tensor, given the data tensor and its factor matrices. It is worth noting that since PARAFAC decomposition does not enforce the orthogonal property in each of the factor matrices, direct computation of Eq. 20 is not feasible. Recent work by [8] proposes an efficient estimation of the core tensor for PARAFAC decomposition. By following the algorithm 1 in [8] to efficiently estimate v l c t and then the embedding of student s at l-th level v l t ∈ R 2R l can be obtained by:

EXPERIMENTS
In this section, we conduct systematic experiments to evaluate the quality of iDisc. In following, we first describe the data used for the experiments. The content of the rest of this section is structured to answer the following questions: • Can iDisc reveal meaningful patterns?
• How does iDisc perform in comparison with state-of-art methods from predictive modeling in learning analytics? • Can iDisc scale for the dataset at a massive scale?

Data
We experiment with nine courses/sessions from three publicllyavailable MOOC platforms: edX, ASSISTments, and XueTangX. The statistics of the datasets are provided in Table 2.
edX. The edX [33] dataset is comprised of de-identified data from "Introduction to Computer Science" (Fall 2012, Spring 2013, and Summer 2013) from MITx (Course A and B in Table 2) and HarvardX (Course C). This dataset does not have detailed event logs. However, the data are aggregated records, where each record represents the summary statistics for one individual's activity in the edX course with her demographic information. We select the three most popular courses from this dataset. For this dataset, we construct a 34x5x5 tensor as Area×Education×Stay. Our goal is to predict whether the course completion certificate is earned by a student at the end of the course. XueTangX. XueTangX [50] is one of the largest MOOC platforms in China. The full dataset includes 79,186 students enrolled in 39 classes. Each enrollment is associated with a log of the student's activities, including watching lecture videos, working on course problems, accessing course modules, and so on. In total, there are 8,157,277 activity logs, and the longest lifetime of enrollment is five weeks. We take the three most popular courses from this dataset. The dataset statistics are shown in Table 2. We use the first two weeks to learn the factor matrices by constructing a 14x7x2 tensor as Day×EventType×EventSource. The goal is to correctly predict course completion.
ASSISTments. ASSISTments [16] is an online tutoring system used by more than 50,000 students around the world [5]. On AS-SITments, students attempt to solve problems and receive feedback on those attempts. To assist the learning process, each problem is also associated with multiple knowledge components. We take the public dataset of the Math course on ASSISTments over three years (2004, 2005, and 2006) and the dataset characteristics are shown in Table 2. For each dataset, we construct a four-way tensor as Problem×KnowledgeComponent×ProblemView ×ActionDuration. Our aim is to classify the students as over-performing students and under-performing students, in terms of error-rate 2 .

Qualitative Examination of the Patterns
In this section, we use "Introduction to Computer Science" on MITx during Spring 2013 to illustrate the outputs of iDisc. We first qualitatively examine the patterns, as well as the pattern hierarchy generated and then explain the relationship between the patterns and the performance outcome. 5.2.1 Common and Discriminative Pattern Discovery. Figure 4 describes the outputs by iDisc, with {R} L l = {4, 2} (rank-4 at the first level and rank-2 at the second level). Particularly, Figure 4(e) shows the project matrix P 1 ∈ R 2×4 that represents the hierarchical relationships between the set of patterns at the first two levels, where darker colors indicate stronger associations. We observe that pattern #1 at level 1(P1@L1) is strongly associated with pattern #2 at level 0(P2@L0). This suggests that P1@L1 could be a child pattern for P2@L0. Similarly, we observe the parent-child relationship Table 3: Regression results for course end performance. Explanatory variables are the students' embeddings that correspond to the patterns presented in Figure 4 (a-d) (e.g., v 0 cer t if ied (2) refers to the values in students' embedding vector v for P2@L0 from the certified group). Note: * * :p<.05; * * * :p<.01. between P4@L0 and P2@L1. Since the projection matrix is shared by the factor matrices from the coupled tensors, it is important to note that each of figures 4(a) through 4(d) refers to both patterns for the certified and non-certified group (i.e. as shown left and right in Figure 4(a)). Due to the space limit, we only discuss the details of two sets of multi-way patterns as word clouds in Figure 4 Figure 4(a) describes a subgroup of students that are from the most populated countries (e.g., the United States, India, and Poland, based on U (ar ea) ) with education background mostly from Secondary to Master's (U (educat ion) ), and most of whom tend to stay on the edX platform (U (st ay ) ) for a relatively long period. While the pattern from the certified group of students shares almost identical distributions in the area and educational background, what makes them slightly different was primarily the time spent on the platform; certified students(U (st ay ) certified ) spend relatively longer than un-certified students (U (st ay ) -certified ). The set of patterns in Figure 4(c) is the child patterns of the Figure 4(a). Compared to their counterpart in level 0, they primarily describe the students from Indian (although the area distribution from the un-certified group of students spans more countries (U (ar ea) -certified )) with more focus on the middle level of education background (e.g., Secondary(U (educat ion) certified )). The difference in their length of stay on the platform was more prominent in this set of patterns, where certified students have the longest stays with the edX platform (U

Simpson's Paradox Revisited.
We performed multivariate logistic regression analysis to identify the patterns that can explain students' variation in obtaining the certificate for each level (M1 for level 0 and M2 for level 1). The dependent variable is whether or not the users are certified at the end of the course. The explanatory variables include the students' embeddings for the aforementioned patterns in Figure 4(a-d) in each level (e.g., v 1 certified (1) refers to the embeddings corresponding to P1@L1 certified ). The embeddings are standardized to facilitate comparison among different variables. Table 3 shows the estimated coefficients for M1 and M2. The only significant variable in M1 is the embeddings v 0 -certified (2) (β = −0.113, p < .05). This suggests that users who have shown more activities in line with pattern P2@L0 -certified appear to have less chance to earn the certificate. M2 shows that both embeddings v 1 certified (2) (β = 2.285, p < .01) and embeddings v 1 -certified (2) (β = 0.733, p < .01) reveal a significant and positive effect towards earning the certificate, with v 1 certified (2) having a much larger effect size. On the other hand, M2 also shows a significant and negative effect of having a larger value in v 1 -certified (1) (β = −0.620, p < .01). We can consider multi-way patterns as principal components in PCA that bridge the original feature interactions in the highdimensional space and the associated loadings. In this case, we would expect students from one class to have higher loading scores associated with patterns that are extracted from the same class.
The regression result in M1 shows that the un-certified group of students does have larger loading scores. However, that is true only in v 0 -certified (2) for P2@L0 . This is not surprising, because in level 0 the two sets of patterns are in fact very similar to each other, as shown in Figure 4. The results in M2 confirm this expectation, with significantly higher loading scores v 1 certified (2) for certified students and significant higher loading scores v 1 -certified (1) for un-certified students. Contrary to our expectation, the certified group of students also have higher loading scores in v 1 -certified (2) (although much lower than v 1 certified (2)). This could be explained by our observation in Figure 1, where P2@L1 captures a cluster of highly-educated students from certain European areas. They have a much higher chance of obtaining the certificate, regardless of having the longest stay or shortest stay with the platform.
Summary. We qualitatively examine the outputs generated by iDisc. The results show interesting properties of the proposed method. Our model reveals common and discriminative patterns at each level with their relationship explained via the projection matrix. Our regression analysis first explains the discriminative capability of the students' embeddings based on this set of patterns. More importantly, the analysis validates that the students' embeddings can be used to measure the relationship between the performance outcome with multi-way patterns from iDisc.

Quantitative Comparison
In this section, we report the results from the quantitative experiments in comparison with existing work commonly used in predictive analytics. Specifically, we conduct a classification task, in which the goal is to predict the students' performance at the end of the course defined in Section 5.1. 5.3.1 Baselines. We include baselines that are commonly seen in the area of predictive analytics in educational data mining as: • Raw. We use the raw activity counts each day as features to train classifiers for prediction. This is the most common approach in predictive modeling for MOOCs. • LDA. Coleman et al. [4] use LDA to capture the temporal element of the behavior data. We first discover the latent behavior patterns from Raw features with a varying number of topics, and use the topic membership of each student for the classification task. • LadFG [36]. As one of the most cited works in MOOC predictive modeling, LadFG is a latent dynamic factor graph model that finds a mapping from students' time-varying attribute tensor to the observed learning outcome. We only evaluate the performance of LadFG in XueTangX dataset because it is the only one of the three that contains the necessary temporal dynamics. • Factorization machines (FM) [37]. Factorization machines have been proposed and successfully applied to recommendation and prediction tasks. As the factorization model projects the input feature space into a latent space, it enables the learning of more complex interactions between features. We first convert each dimension as dummy variables for each student, and then concatenate all dimensions as a wide feature matrix.
It is worth noting the recent use of Deep Neural Nets (DNN) and their variants have shown promising performance compared to conventional machine learning approaches (e.g., [21,45,49]). However, the lack of interpretability of these models prevents their further application in problems driven by both interpretations and performance gains. We also compare iDisc with the existing work on discovering the common and discriminative patterns from multiway data.
• SDCDNTF. SDCDNTF extends [22] and learns the common and discriminative patterns with different ranks. The input to the model consists of the rank and the number of shared patterns, along with the coupled tensors. • PairFac. PairFac [47] learns the common and discriminative patterns with different ranks. Comparing to SDCDNTF, it does not require the input of the split.
We would like to point out that standard tensor factorization could serve as another baseline to compare with, in which students or their class reside as one of the dimensions and the corresponding factor matrix can naturally become features for downstream prediction tasks. However, we did not include it for two reasons: first, because the aforementioned baselines work as inductive models, where unseen students can be predicted based on the learned parameters, while simple tensor factorization serves as a transductive model and only predicts for the students that are available in the factorization; and second, because student populations on MOOC platforms can be of any size, from small to very large, the efficiency of standard tensor factorization with a large dimension size could be a practical problem for its real-world application.

Experiment Settings.
For each dataset, we draw a training set of students from each class with replacement, and then obtain the embeddings of the out-of-bootstrap students. For this set of students, we perform a five-fold cross-validation with a k Nearest Neighbors classifier. We conduct five independent trials of this experiment and report the average classification accuracy. We select accuracy since both the training and testing dataset are constructed in a way that each class has an equal amount of students. For SDCDNTF, we experiment with α, β and γ ∈ {10 −5 , 10 −4 , 10 −3 }. Finally, we set α and β in the same range, and R = 6 for both PairFac and SDCDNTF and derive two versions of SDCDNTF using K ∈ {2, 4}. To make a fair comparison with PairFac and SDCDNTF , we use two-level pattern discovery with rank-4 and rank-2 in each level for iDisc, respectively. For LDA and FM, we experiment with a varying number of topics /factors and report the best performance. For LadFG, we keep the suggested parameters from their paper.  Table 4 shows the classification results. The raw features perform poorly, especially in XueTangX and AS-SISTments dataset, with the score on accuracy in the range of 60-70%, which suggests the difficulty of the prediction task. LDA saw different performance, with noticeable drops in the edX dataset in comparison to raw features. We conjecture that the construction of the raw features results in a high dimensional and sparse feature space, which could potentially cause LDA to suffer from learning merely meaningful latent topics. Factorization machines slightly improve the performance. FMs can be considered as a generalization of tensor factorization with the additional modeling of interactions within each dimension [37]. Although we observe noticeable gains from FMs over Raw features and LDA, FMs do not perform as well as tensor-based methods. We suspect there might be two reasons for this: 1) the current feature space might not be as well tuned for general prediction tasks as it is for the more commonly seen recommendation tasks; 2) compared to tensor-based methods that only consider the interactions between different dimensions, FMs could potentially over-fit the interaction effects between and within dimensions in the training data. We observe that LadFG achieves large gains over the raw features for the XueTangX dataset. This indicates there could exist some hidden patterns that can capture the temporal elements of the behavior data. We also notice that tensor-based models such as PairFacand SDCDNTFperform better than LDA, especially in Course 2 and Course 3. This suggests that by systematically considering the multi-way interactions, the performance could be further improved. Finally, the best performance of iDisc is statistically comparable with the state-of-art LadFG and significantly better than the rest of the baselines. While LadFG is geared towards student performance predictions, iDisc can provide comparable prediction performance as well as meaningful patterns. Summary. iDisc constructs students' embeddings that integrate relations between the multi-way interactions and the performance outcome. The quantitative experiment demonstrates the discriminative capability of iDisc, and iDisc outperforms the baselines in nine datasets from three MOOC platforms. Higher-level embeddings from iDisc have shown stronger discriminative powers over ones from the lower levels, which we will discuss in next section. Since there is no trivial solution in determing the rank of the tensor decomposition, we experimented with different rank settings for iDisc (e.g., {2, 4}, {3, 3}) and this observation still holds.

Scalability
Since many of the education platforms have seen exponential growth in usage, scalable solutions of learning analytic are another critical aspect of adoption. In this section, we test the scalability of iDisc. In this experiment, we choose the ASSISTments dataset for the year of 2006, since it has the largest tensor settings in our experiment. We run iDisc with a varying number of entries in the data tensor, from {10 2 , 10 3 , 10 4 , 10 5 , 10 6 }. Table 5 reports the average running time per epoch, and we observe that the running time scales almost linearly with the exponential increase in observations in the tensor. This result is consistent with the analysis in Section 4.1.4.

CONCLUSION
In this paper, we present a tensor-based learning framework, iDisc, to perform common and discriminative pattern discovery at multiple levels for understanding of high-dimensional student behavior and performance prediction in MOOCs. We first use tensors to represent each user's behavior, and construct coupled tensors to aggregate behavior for users with contrasting performance groups. Then, we iteratively identify the shared and distinct behavioral patterns at various levels, while revealing the hierarchical relationship between them to further increase the interpretability of the output. Finally, we use these patterns as anchors to generate the students' latent representation for down-stream performance prediction. Our qualitative examination of the patterns has shown the multi-level, multi-aspect and hierarchical characteristics of behavior patterns on the edX platform. The quantitative experiments, compared to both traditional predictive methods as well as existing discriminative tensor factorization models, suggest promising results by iDisc in several datasets from different MOOC platforms.
To the best of our knowledge, this is the first attempt to tackle the joint problem of discriminant tensor factorization and hierarchical pattern discovery for understanding such behavior on MOOC platforms. This enables the in-depth comprehension of students' multi-way behavior dynamics, as well as its association with course performance. Nevertheless, one of the limitations is that it merely provides the relationships between the latent multi-way interaction and the performance outcome, with no intention to draw causal reasoning between them. In practice, iDisc can be developed as a plugin for MOOC platforms, where instructors can examine the multi-aspect contrasting behavior and connect the difference to the course outcome. Considering the XueTangX platform, one of the multi-aspect patterns could refer to a set of events at the beginning of the course that trigger from the server. if iDisc reveals its positive association to the success of students' course end performance, this pattern can be used as guidance of promotions for both the instructor and the platform to improve the students' learning outcome. Last, but not least, compared to other tensor factorization methods, iDisc provides a more efficient exploration of the multi-aspect patterns due to its multi-level nature. However, we understand that the interpretation of the multi-aspect pattern itself is not straightforward in general. In our future work, we would like to follow a more human-centric approach and develop a visual analytic system that helps domain experts interpret and understand the multi-aspect patterns.