Predicting students ’ academic performance using a modi ﬁ ed kNN algorithm

,


INTRODUCTION
Data understanding is an important step for accurate analysis.Data pre-processing is the first step needed to aid algorithms and to improve efficiency before proceeding to the actual analysis.Data variables generally fall into one of the four broad categories: nominal scale, ordinal scale, interval scale, and ratio scale [1].Nominal values have no quantitative value.They represent categories or classifications.For example, gender nominal variable in the datasets which take (male, female).Another one is the marital status, which takes values like (married, unmarried, divorced, and separated); here, both examples simply denote categories [1].Ordinal variables refer to variables that show the order in measurement.For example, low/medium/high values of size variable.The ordering exists in those variables, but distances between the categories cannot be quantified.Interval scales provide order information.Besides, they possess equal intervals.For instance, the temperature is an interval data type that is measured either by Fahrenheit or by Celsius scale.Ratio scale possesses qualities of nominal, ordinal and interval scales, also has absolute zero value.In addition to, it also permits comparisons between different variables values.
The k-Nearest Neighbors (kNN) is a straightforward algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).It is a nonparametric technique that has been used in statistical estimation and pattern recognition since 1970.It uses a majority vote principle to classify new cases.Most data mining techniques cannot handle categorical variables unless they are converted to numerical variables.For example, the dataset [2] has a mixed type of attributes, categorical, and numerical.A pre-processing step is needed to transform categorical attributes into numerical ones.There are many techniques to handle categorical values like mapping and labels encoding into Pandas, Python.However, assigning numerical values to nominal attributes misleads the machine learning algorithms learning by making difference or order between values that are not originally existed in the attributes and this phenomenon is called subjectivity.For instance, gender attribute; male can be encoded as 1 and female as 0, or the opposite.There is no standard way in encoding nominal variables.
Jawthari et al. [3] studied the effect of subjectivity where was emphasized in assigning numerical values to nonordinal categorical attributes.That research shed the light on subjectivity using an educational dataset, especially a student performance prediction dataset.This research proposes two similarity measures for kNN algorithm to deal with categorical variables without converting them as numerical.Therefore, the algorithm overcomes the subjective encoding issue.

RELATED WORKS
The Educational Data Mining (EDM) is an evolving discipline that deals with the creation of methods for exploring the specific and increasingly large-scale knowledge that comes from educational environments and using these methods to better understand students and the environments in which they learn [3,4].One concern of EDM is predicting students' performances.The previous work [3] used various Machine Learning (ML) techniques to predict the students' performances.This article also focused on the effect of the way of encoding the nominal variables on classification accuracy of machine learning techniques as in [3], which showed that the accuracy was affected by the approach of encoding.That study, [3] recommended solving the problem using some method that does not need to convert nominal attribute to numeric.Hence, this study is to find a solution for that issue.
The kNN is one of the most popular classification algorithms due to its simplicity [5].It stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).It classifies a new sample by a majority vote of its neighbors, with the case being assigned to the group most common amongst its k nearest neighbors kNN measured by a distance function.Euclidean distance, formula 1, is a usual similarity measure used by kNN, especially for continuous attributes, it depends mainly on the value of k.The following figure shows how the k values affect the class assignment.For instance, in Fig. 1 p refers to new point to be classified either dark square label or empty circle label.Here, p belongs to the dark square class if k 5 1; if k 5 5, then it is classified as the small circle class due to majority vote rule [6,7].

Distance functions
To measure the distance between points X and Y in a feature space, various distance functions have been used in the literature, in which the Euclidean distance function, Eq. ( 1), is the most widely used [8].Other functions, Eqs. ( 3) and ( 4), are used to calculate the distance between continuous variables too.For categorical variables, the Hamming distance, Eq. ( 2), is used.Equation ( 3) is used to find the distance between two sets A and B and is employed as a function to find the distance between categorical variables.Let X and Y are represented by feature vectors X ¼ fx 1 ; x 2 ; ⋯; x m g and Y ¼ fy 1 ; y 2 ; ⋯; y m g, where m is the dimensionality of the feature space, dðx; yÞ The k-prototypes algorithm combines the k-means and k-modes algorithms to deal with the mixed data types [9].The k-prototypes algorithm is more useful practically because the real-world data is mixed.Assume a set of n objects, D ¼ fX 1 ; X 2 ; ⋯; X n g.Each X i is called a sample or a row that consists of m attributes: X i ¼ fx i1 ; x i2 ; ⋯; x im g.The sample consists of numerical and categorical attributes (mn is numerical attributes, mc is categorical attributes).The aim of this algorithm is to partition the n samples into k disjoint clusters C ¼ fC 1 ; C 2 ; ⋯; C k g, where C i represents an i-th cluster center.K-prototypes calculate the distances between numerical features and categorical features separately and merge results.
kNN predicts the class label by majority voting of its nearest neighbors k.Let X ¼ ðx i ; y i Þ N i¼1 , where x i is a feature vector, which has m dimensions, and y i is the corresponding label.Given a new point or a query x q .Its unknown label can be predicted using two steps.First step uses Eq. ( 1) to identify a set of k similar neighbors.Denotes the set , arranged in decreasing order according to Euclidean distance.The dðy ¼ y i Þ in Eq. ( 6) takes one if y ¼ y i and zero otherwise, where y represents a class label and y i represents the class label for the i-th nearest neighbor among its k nearest neighbors [10].Although the majority vote is simple, it has two drawbacks: ties are possible and, all distances are equally weighted.To overcome those issues, a weighted voting method for kNN called the distance-Weighted k-Nearest Neighbor rule (WkNN) was first introduced by Dudani [11].In WkNN, the closest neighbors are weighted more heavily than the farther ones, using the distance-weighted function.In this paper, Eq. ( 7) is used as a distance-weighted function to obtain the weight w i for i-th nearest neighbor of the query.Equation ( 8) is used for voting to predict the new point label, Here harmonic series, Eq. ( 9) is also used as a vote rule, which uses the rank of the k-nearest distances ð1; 2; ⋯; kÞ instead of the distances themselves, to assign weights, and compared its accuracy results with results obtained by weighted vote rule, The literature is rich in methods used for clustering mixed-type datasets, but to the best of author's knowledge, there is no method that classifies categorical data using the proposed similarity measures in this paper.The proposed idea of this study was inspired by k-prototypes.Hamming distance and Jaccard distance functions are employed to obtain the distance between nominal attributes.Besides, Euclidean distance is utilized to calculate the distance of the numerical attributes as usual.The final distance is obtained by combining distance of nominal variables and distance of numerical variables.

PROPOSED KNN ALGORITHM
Motivated by K-prototypes mentioned above, the issue of subjective encoding of nominal variables, the issue of majority vote, a simple and effective kNN method is designed by proposing two similarity measures.The method does not need nominal variables encoding.In addition, the enhanced method considers the distance weight vote rule that give a greater weight to the closer neighbors.
The algorithm is described as below: Pre-processing: Assume dataset is split into test and train sets with only nominal and conƟnuous aƩributes.3. Encode nominal aƩributes using one_hot encoding (opƟonal).We use this step as Scipy library in Python, it is faster in calculaƟng hamming and Jaccard distances.

Algorithm 1 Algorithm 2
The other version of this algorithm uses Jaccard distance, Eq. ( 3) to calculate the similarity between categorical attributes.Besides, it used Euclidean distance for numeric attributes.Steps 2, 3, 4, and 5 above are same for the algorithm.

DATA SET
The data set was collected by using a learner activity tracker tool, which called experience API (xAPI).The purpose was to monitor the students' behavior to evaluate the features that may impact their performance [2].

Data mining
The dataset includes 480 student records with 16 features as it is shown in Table 1

end for
Step 2: Sort the distance in ascending order Step 3: Search the nearest neighbors of the query , N {( , 4: Calculate the weights of k nearest neighbors using Eq. ( 6), = { , … , Step 5: Assign a class label to the query either a ,with previous step weights, or b.
a-Majority weighted voƟng Eq. ( 8) b-Harmonic vote Eq. ( 9) Figure 2 shows relationship between class variables and numerical features.It also shows the importance of behavior features.The student's who participated more in the Visi-TedResourses, AnnouncementViews, RaisedHands, and Discussion-they achieved better results.

RESULT AND ANALYSIS
First, the dataset is split into training and testing sets by a 20% ratio arbitrarily.The training and test datasets were split into corresponding datasets that have numerical attributes and categorical attributes.The enhanced kNN with proposed similarity measures was applied.To compare the performance of proposed method, the categorical variables of the dataset were one_hot encoded.Then, a standard kNN from the Scikit-learn library was used [12].
The best accuracy resulted from standard kNN was 66.7 with k equals 1 as it is shown in Table 2 column 1. Hamming distance and harmonic vote kNN's best result was 72.9 with k equals 18-20 as it can be seen in Table 2, column 2. Hamming distance and weight distance vote version of kNN got the best accuracy as 78.1 with k equals 12 and 20.Columns 4 and 5 show the accuracy results of the kNN method using Jaccard distance for nominal variables.The Jaccard and harmonic vote kNN resulted in 76.0 with k equals 6 and 19 as it is described in column 4 form Table 2.The best version was the one that used Jaccard and wight distance vote as it is shown in the last column of Table 2.This version had 80.2 accuracy resulted from k equals 6.By running the algorithms multiple times, the standard one had different accuracy results.For example, one time, it had 77.0 accuracy with k equals 1.On the other hand, the proposed method had almost the same results in each run.Therefore, the proposed method is not sensitive to outliers in the data.Consequently, the proposed kNN algorithm outperforms standard kNN in accuracy.Figures 3 and 4 show the

CONCLUSION
This paper introduces two similarity measures to make kNN work with mixed type data, especially nominal case.The proposed method also enhanced sensitivity of kNN to outliers by considering alternative voting rules.This research contribution is to design a distance function for making classification decision without converting nominal variables to numeric.To verify the proposed classifier, experiments were conducted on the educational dataset and the results were compared with kNN algorithm after one_hot encoding nominal attributes.Experiments showed the proposed method using Jaccard distance always outperformed the standard kNN with 14%.The enhanced kNN algorithm showed good performance in terms of accuracy, but it was slow compared to scikit-learn kNN.In the future work, the algorithm speed will be improved by incorporating fast comparing techniques.In addition, the algorithm can be used with different datasets from different fields to further show its performance.

Fig. 2 .
Fig. 2. Correlation between the dependent variable and numerical variables