RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

决策树规则与剪枝(Decision Tree Rules & Pruning)

来源: 作者:unkonwn 时间:2004-11-27 点击:

References:

  • T. Mitchell, 1997.
  • R. Myers, R. Walpole, "Tests of Hypotheses", in R. Myers, R. Walpole, Probability and Statistics for Engineers and Scientists, Second Edition, Macmillan Publishing Co., Inc., New York, NY, 1978, pp. 268 - 273.
  • P. Winston, 1992.

Rule Generation 数据挖掘研究院

Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent set of rules.

Converting a decision tree to rules before pruning has three main advantages: 数据挖掘研究院

  1. Converting to rules allows distinguishing among the different contexts in which a decision node is used.
    • Since each distinct path through the decision tree node produces a distinct rule, the pruning decision regarding that attribute test can be made differently for each path.
    • In contrast, if the tree itself were pruned, the only two choices would be:
      1. Remove the decision node completely, or
      2. Retain it in its original form.
  2. Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those that occur near the leaves.
    • We thus avoid messy bookkeeping issues such as how to reorganize the tree if the root node is pruned while retaining part of the subtree below this test.
  3. Converting to rules improves readability.
    • Rules are often easier for people to understand.

To generate rules, trace each path in the decision tree, from root node to leaf node, recording the test outcomes as antecedents and the leaf-node classification as the consequent.

数据挖掘研究院

Rule Simplification Overview 数据挖掘实验室

Once a rule set has been devised: 数据挖掘研究院

  1. Eliminate unecessary rule antecedents to simplify the rules.
    • Construct contingency tables for each rule consisting of more than one antecedent.
      • Rules with only one antecedent cannot be further simplified, so we only consider those with two or more.
    • To simplify a rule, eliminate antecedents that have no effect on the conclusion reached by the rule.
    • A conclusion′s independence from an antecendent is verified using a test for independency, which is
      • a chi-square test if the expected cell frequencies are greater than 10.
      • Yates′ Correction for Continuity when the expected frequencies are between 5 and 10.
      • Fisher′s Exact Test for expected frequencies less than 5.
  2. Eliminate unecessary rules to simplify the rule set.
    • Once individual rules have been simplified by eliminating redundant antecedents, simplify the entire set by eliminating unecessary rules.
    • Attempt to replace those rules that share the most common consequent by a default rule that is triggered when no other rule is triggered.
      • In the event of a tie, use some heuristic tie breaker to choose a default rule.

Contingency Tables

The following is a contingency table, a tabular representation of a rule.

数据挖掘实验室

  C1 C2 Marginal Sums
R1 x11 x12 R1T = x11 + x12
R2 x21 x22 R2T = x21 + x22
Marginal Sums CT1 = x11 + x21 CT2 = x12 + x22 T = x11 + x12 + x21 + x22

R1 and R2 represent the Boolean states of an antecedent for the conclusions C1 and C2 数据挖掘实验室
(C2 is the negation of C1).
x11, x12, x21 and x22 represent the frequencies of each antecedent-consequent pair.
R1T, R2T, CT1, CT2 are the marginal sums of the rows and columns, respectively.

数据挖掘研究院

The marginal sums and T, the total frequency of the table, are used to calculate expected cell values in step 3 of the test for independence.

数据挖掘研究院

Test for Independence 数据挖掘研究院

Given a contingency table of dimensions r by c (rows x columns): 数据挖掘实验室

  1. Calculate and fix the sizes of the marginal sums.

  2. Calculate the total frequency, T, using the marginal sums. 数据挖掘实验室

  3. Calculate the expected frequencies for each cell. 数据挖掘研究院

    The general formula for obtaining the expected frequency of any cell xij, 1ir, 1jc in a contingency table is given by:

    数据挖掘研究院

    where RiT and CTj are the row total for ith row and the column total for jth column. 数据挖掘研究院

  4. Select the test to be used to calculate based on the highest expected frequency, m:

    if then use
    m 10 Chi-Square Test
    5 m 10 Yates′ Correction for Continuity
    m 5 Fisher′s Exact Test

     

    数据挖掘研究院

  5. Calculate using the chosen test.

  6. Calculate the degrees of freedom.

    数据挖掘研究院

    df = (r - 1)(c - 1) 数据挖掘研究院

  7. Use a chi-square table with and df to determine if the conclusions are independent from the antecedent at the selected level of significance, .
    • Assume = 0.05 unless otherwise stated. 数据挖掘研究院

    • If
      • Reject the null hypothesis of independence and accept the alternate hypothesis of dependence.
        • We keep the antecedents because the conclusions are dependent upon them.
    • If
      • Accept the null hypothesis of independence.
        • We discard the antecedents because the conclusions are independent from them.

Chi-Square Formulae 数据挖掘研究院

 

数据挖掘研究院

Click here for an exercise in decision tree pruning. 数据挖掘实验室

  数据挖掘实验室

Decision Lists

数据挖掘研究院

A decision list is a set of if-then statements.

数据挖掘研究院

It is searched sequentially for an appropriate if-then statement to be used as a rule. 数据挖掘研究院

  数据挖掘实验室

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?