A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:
- a is the number of correct predictions that an instance is negative,
- b is the number of incorrect predictions that an instance is positive,
- c is the number of incorrect of predictions that an instance negative, and
- d is the number of correct predictions that an instance is positive.
| Predicted | |||
| Negative | Positive | ||
| Actual | Negative | a | b |
Several standard terms have been defined for the 2 class matrix:
- The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:
- The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:
- The false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive, as calculated using the equation:
- The true negative rate (TN) is defined as the proportion of negatives cases that were classified correctly, as calculated using the equation:
- The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative, as calculated using the equation:
- Finally, precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998). Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), as defined in equations 7 and 8, and F-Measure (Lewis and Gale, 1994), as defined in equation 9. 数据挖掘研究院
[7]
[8] 数据挖掘研究院
[9]
In equation 9, b has a value from 0 to infinity and is used to control the weight assigned to TP and P. Any classifier evaluated using equations 7, 8 or 9 will have a measure value of 0, if all positive cases are classified incorrectly. 数据挖掘实验室
Another way to examine the performance of classifiers is to use a ROC graph, described on the next page.
ROC graphs are another way besides confusion matrices to examine the performance of classifiers (Swets, 1988). A ROC graph is a plot with the false positive rate on the X axis and the true positive rate on the Y axis. The point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. It is (0,1) because the false positive rate is 0 (none), and the true positive rate is 1 (all). The point (0,0) represents a classifier that predicts all cases to be negative, while the point (1,1) corresponds to a classifier that predicts every case to be positive. Point (1,0) is the classifier that is incorrect for all classifications. In many cases, a classifier has a parameter that can be adjusted to increase TP at the cost of an increased FP or decrease FP at the cost of a decrease in TP. Each parameter setting provides a (FP, TP) pair and a series of such pairs can be used to plot an ROC curve. A non-parametric classifier is represented by a single ROC point, corresponding to its (FP,TP) pair.
The above figure shows an example of an ROC graph with two ROC curves labeled C1 and C2, and two ROC points labeled P1 and P2. Non-parametric algorithms produce a single ROC point for a particular data set. 数据挖掘实验室
Features of ROC Graphs
- An ROC curve or point is independent of class distribution or error costs (Provost et al., 1998).
- An ROC graph encapsulates all information contained in the confusion matrix, since FN is the complement of TP and TN is the complement of FP (Swets, 1988).
- ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified.
Area-based Accuracy Measure 数据挖掘研究院
It has been suggested that the area beneath an ROC curve can be used as a measure of accuracy in many applications (Swets, 1988). Provost and Fawcett (1997) argue that using classification accuracy to compare classifiers is not adequate unless cost and class distributions are completely unknown and a single classifier must be chosen to handle any situation. They propose a method of evaluating classifiers using a ROC graph and imprecise cost and class distribution information. 数据挖掘研究院
Euclidian Distance Comparison
Another way of comparing ROC points is by using an equation that equates accuracy with the Euclidian distance from the perfect classifier, point (0,1) on the graph. We include a weight factor that allows us to define relative misclassification costs, if such information is available. We define ACd as a distance based performance measure for an ROC point and calculate it using the equation:
[22] 数据挖掘研究院
where W is a factor, ranging from 0 to 1, that is used to assign relative importance to false positives and false negatives. ACd ranges from 0 for the perfect classifier to sqrt(2) for a classifier that classifies all cases incorrectly, point (1,0) on the ROC graph. ACd differs from g-mean1, g-mean2 and F-measure in that it is equal to zero only if all cases are classified correctly. In other words, a classifier evaluated using ACd gets some credit for correct classification of negative cases, regardless of its accuracy in correctly identifying positive cases. 数据挖掘研究院
Example 数据挖掘研究院
Consider two algorithms A and B that perform adequately against most data sets. However, assume both A and B misclassify all positive cases in a particular data set and A classifies 10 times the number of infrequent itemsets as potentially frequent compared to B. Algorithm B is the better algorithm in this case because there has been less wasted effort counting infrequent itemsets. 数据挖掘研究院

