RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

数据挖掘分类算法 : Classification-rule learning

来源: 作者:unkonwn 时间:2004-12-11 点击:

Classification-rule learning involves finding rules or decision trees that partition given data into predefined classes. For any realistic problem domain of the classification-rule learning, the set of possible decision trees is too large to be searched exhaustively. In fact, the computational complexity of finding an optimal classification decision tree is NP hard.

Most of the existing induction-based algorithms use Hunt′s method as the basic algorithm.[2] Here is a recursive description of Hunt′s method for constructing a decision tree from a set T of training cases with classes denoted {C1, C2, … ,Ck }. 数据挖掘研究院

Case 1 T contains one or more cases, all belonging to a single class Cj : The decision tree for T is a leaf identifying class Cj .
Case 2 T contains no cases: The decision tree for T is a leaf, but the class to be associated with the leaf must be determined from information other than T.
Case 3 T contains cases that belong to a mixture of classes: A test is chosen, based on a single attribute, that has one or more mutually exclusive outcomes {O1, O2 , .. ,On }. T is partitioned into subsets T1, T2, … ,Tn , where Ti contains all the cases in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test, and one branch for each possible outcome. The same tree building machinery is applied recursively to each subset of training cases.
 
 

数据挖掘实验室

Outlook
Temp(F)
Humidity(%)
Windy?
Class
sunny
75
70
true
Play
sunny
80
90
true
Don′t Play
sunny
85
85
false
Don′t Play
sunny
72
95
false
Don′t Play
sunny
69
70
false
Play
overcast
72
90
true
Play
overcast
83
78
false
Play
overcast
64
65
true
Play
overcast
81
75
false
Play
rain
71
80
true
Don′t Play
rain
65
70
true
Don′t Play
rain
75
80
false
Play
rain
68
80
false
Play
rain
70
96
false
Play
Table 2.1: A small training data set [2]

Figure 2.1: Demonstration of Hunt′s Method
 
 
Attribute Value
Class
Play Don′t Play 
sunny 2
overcast 4
rain 3
Table 2.2: Class Distribution Information of Attribute Outlook
 
Attribute Value Binary Test 
Class
Play  Don′t Play 
65 1 0
> 8 5
70 3 1
> 6 4
75 4 1
> 5 4
78 5 1
> 4 4
80 7 2
> 2 3
85 7 3
> 2 2
90 8 4
> 1 1
95 8 5
> 1 0
96 9 5
> 0 0
Table 2.3: Class Distribution Information of Attribute Humidity

Table 2.1 shows a training data set with four data attributes and two classes. Figure 2.1 shows how the Hunt′s method works with the training data set. In the case 3 of the Hunt′s method, a test based on a single attribute is chosen for expanding the current node. The choice of an attribute is normally based on the entropy gains of the attributes. The entropy of an attribute is calculated from class distribution information. For a discrete attribute, class distribution information of each value of the attribute is required. Table 2.2 shows the class distribution information of data attribute Outlook. For a continuous attribute, binary test involving all the distinct values of the attribute is considered. Table 2.3 shows the class distribution information of data attribute Humidity. Once the class distribution information of all the attributes are gathered, the entropy is calculated based on either information theory or Gini Index. One attribute with the most entropy gain is selected as a test for the node expansion. 数据挖掘实验室
 

  • ID3 algorithm
The ID3 algorithm (Quinlan86) is a decision tree building algorithm which determines the classification of objects by testing the values of the their properties. It builds the tree in a top down fashion, starting from a set of objects and a specification of properties. At each node of the tree, a property is tested and the results used to partition the object set. This process is recursively done till the set in a given subtree is homogeneous with respect to the classification criteria - in other words it contains objects belonging to the same category. This then becomes a leaf node. At each node, the property to test is chosen based on information theoretic criteria that seek to maximize information gain and minimize entropy. In simpler terms, that property is tested which divides the candidate set in the most homogeneous subsets.
 
  • C4.5 algorithm
This algorithm was proposed by Quinlan (1993). The C4.5 algorithm generates a classification-decision tree for the given data-set by recursive partitioning of data. The decision is grown using Depth-first strategy. The algorithm considers all the possible tests that can split the data set and selects a test that gives the best information gain. For each discrete attribute, one test with outcomes as many as the number of distinct values of the attribute is considered. For each continuous attribute, binary tests involving every distinct values of the attribute are considered. In order to gather the entropy gain of all these binary tests efficiently, the training data set belonging to the node in consideration is sorted for the values of the continuous attribute and the entropy gains of the binary cut based on each distinct values are calculated in one scan of the sorted data. This process is repeated for each continuous attributes.
  • SLIQ algorithm
SLIQ (Supervised Learning In Quest) developed by IBM′s Quest project team, is a decision tree classifier designed to classify large training data [1]. It uses a pre-sorting technique in the tree-growth phase. This helps avoid costly sorting at each node. SLIQ keeps a separate sorted list for each continuous attribute and a separate list called class list. An entry in the class list corresponds to a data item, and has a class label and name of the node it belongs in the decision tree. An entry in the sorted attribute list has an attribute value and the index of data item in the class list. SLIQ grows the decision tree in breadth-first manner. For each attribute, it scans the corresponding sorted list and calculate entropy values of each distinct values of all the nodes in the frontier of the decision tree simultaneously. After the entropy values have been calculated for each attribute, one attribute is chosen for a split for each nodes in the current frontier, and they are expanded to have a new frontier. Then one more scan of the sorted attribute list is performed to update the class list for the new nodes.

While SLIQ handles disk-resident data that are too large to fit in memory, it still requires some information to stay memory-resident which grows in direct proportion to the number of input records, putting a hard-limit on the size of training data. The Quest team has recently designed a new decision-tree-based classification algorithm, called SPRINT (Scalable PaRallelizable INduction of decision Trees) that for the removes all of the memory restrictions.

数据挖掘研究院


 

数据挖掘研究院

  • Other Algorithms
There are many other machine learning algorithms , discussing all of them is outside the scope of this paper. Some of these algorithms are listed below [8].

Nearest-neighbor The classical nearest-neighbor with options for weight setting, normalizations, and editing (Dasarathy 1990, Aha 1992, Wettschereck 1994).

Naive-Bayes A simple induction algorithm that assumes a conditional independence model of attributes given the label (Domingos & Pazzani 1996, Langley, Iba & Thompson 1992, Duda & Hart 1973, Good 1965).

数据挖掘研究院

OODG Oblivous read-Once Decision Graph induction algorithm described in Kohavi (1995c).

数据挖掘研究院

Lazy decision trees An algorithm for building the ``best′′ decision tree for every test instance described in Friedman, Kohavi & Yun (1996). 数据挖掘研究院

Decision Table A simple lookup table. A simple algorithm that is useful with feature subset selection 数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?