RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎
当前位置 :| 首页>人工智能>机器学习>

Boosting算法及其代码实现

来源: 作者:互联网作品 时间:2007-04-04 点击:

一般的机器学习方法可能就是监督学习方法:训练数据由输入和期望的输出组成,然後对非训练数据进行预测输出,也就是找出输入x与输出y之间的函数关系F: y = F(x)。根据输出的精确特性又可以分为分类和回归。 数据挖掘研究院

Boosting 是个非常强大的学习方法, 它也是一个监督的分类学习方法。它组合许多“弱”分类器来产生一个强大的分类器组。 A weak classifier is only required to be better than chance, and thus can be very simple and computationally inexpensive. Many of them smartly combined, however, result in a strong classifier, which often outperforms most 'monolithic' strong classifiers such as SVMs and Neural Networks.

数据挖掘研究院

Decision trees are the most popular weak classifiers used in boosting schemes. Often the simplest decision trees with only a single split node per tree (called stumps) are sufficient.

Learning of boosted model is based on N training examples {(xi,yi)}1N with xi ∈ RK and yi ∈ {−1, +1}. xi is a K-component vector. Each component encodes a feature relevant for the learning task at hand. The desired two-class output is encoded as −1 and +1. 数据挖掘研究院

Different variants of boosting are known such as Discrete Adaboost, Real AdaBoost, LogitBoost, and Gentle AdaBoost [FHT98]. All of them are very similar in their overall structure. Therefore, we will look only at the standard two-class Discrete AdaBoost algorithm as shown in the box below. Each sample is initially assigned the same weight (step 2). Next a weak classifier fm(x) is trained on the weighted training data (step 3a). Its weighted training error and scaling factor cm is computed (step 3b). The weights are increased for training samples, which have been misclassified (step 3c). All weights are then normalized, and the process of finding the next week classifier continues for another M-1 times. The final classifier F(x) is the sign of the weighted sum over the individual weak classifiers (step 4).

  1. Given N examples (xi,yi) with x_i \in R^k, y_i \in {-1, +1}.
  2. Start with weights w_i = 1/N, i = 1,\cdots,N.
  3. Repeat for m = 1,2,…,M:
    1. Fit the classifier f_m(x) \in {-1,1}, using weights wi on the training data.
    2. Compute err_m = E_w [1(y \neq f_m(x))], c_m = \log((1 - err_m)/err_m).
    3. Set w_i \leftarrow w_i \exp[c_m 1(y_i \neq f_m(x_i))], i = 1,2,\cdots,N, and renormalize so that Σiwi = 1.
    4. Output the classifier sign[\Sigma_{m = 1}^M c_m f_m(x)].

Two-class Discrete AdaBoost Algorithm: Training (steps 1 to 3) and Evaluation (step 4)

NOTE. As well as the classical boosting methods, the current implementation supports 2-class classifiers only. For M>2 classes there is AdaBoost.MH algorithm, described in [FHT98], that reduces the problem to the 2-class problem, yet with much larger training set.

In order to reduce computation time for boosted models without substantial loosing of the accuracy, the influence trimming technique may be employed. As the training algorithm proceeds and the number of trees in the ensemble is increased, a larger number of the training samples are classified correctly and with increasing confidence, thereby those samples receive smaller weights on the subsequent iterations. Examples with very low relative weight have small impact on training of the weak classifier. Thus such examples may be excluded during the weak classifier training without having much effect on the induced classifier. This process is controlled via the weight_trim_rate parameter. Only examples with the summary fraction weight_trim_rate of the total weight mass are used in the weak classifier training. Note that the weights for all training examples are recomputed at each training iteration. Examples deleted at a particular iteration may be used again for learning some of the weak classifiers further [FHT98].

数据挖掘研究院

[HTF01] Hastie, T., Tibshirani, R., Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. 2001. 数据挖掘研究院

[FHT98] Friedman, J. H., Hastie, T. and Tibshirani, R. Additive Logistic Regression: a Statistical View of Boosting. Technical Report, Dept. of Statistics, Stanford University, 1998.

CvBoostParams

Boosting training parameters 数据挖掘研究院

struct CvBoostParams : public CvDTreeParams
{
    int boost_type;
    int weak_count;
    int split_criteria;
    double weight_trim_rate;

    CvBoostParams();
    CvBoostParams( int boost_type, int weak_count, double weight_trim_rate,
                   int max_depth, bool use_surrogates, const float* priors );
};
 

数据挖掘研究院

boost_type
Boosting type, one of the following
CvBoost::DISCRETE - Discrete AdaBoost
CvBoost::REAL - Real AdaBoost
CvBoost::LOGIT - LogitBoost
CvBoost::GENTLE - Gentle AdaBoost
Gentle AdaBoost and Real AdaBoost are often the preferrable choices.
weak_count
The number of weak classifiers to build.
split_criteria
Splitting criteria, used to choose optimal splits during a weak tree ;construction:
CvBoost::DEFAULT - Use the default criteria for the particular boosting ;method, see below.
CvBoost::GINI - Use Gini index. This is default option for Real AdaBoost; may be also used for Discrete AdaBoost.
CvBoost::MISCLASS - Use misclassification rate. This is default option for Discrete AdaBoost; may be also used for Real AdaBoost.
CvBoost::SQERR - Use least squares criteria. This is default and the only option for LogitBoost and Gentle AdaBoost.
weight_trim_rate
The weight trimming ratio, within 0..1. See the discussion of it above. If the parameter is ≤0 or >1, the trimming is not used, all the samples are used at each iteration. The default value is 0.95.

The structure is derived from CvDTreeParams, but not all of the decision tree parameters are supported. In particular, cross-validation is not supported. 数据挖掘实验室

CvBoostTree

Weak tree classifier 数据挖掘研究院

class CvBoostTree: public CvDTree
{
public:
    CvBoostTree();
    virtual ~CvBoostTree();

    virtual bool train( CvDTreeTrainData* _train_data,
                        const CvMat* subsample_idx, CvBoost* ensemble );
    virtual void scale( double s );
    virtual void read( CvFileStorage* fs, CvFileNode* node,
                       CvBoost* ensemble, CvDTreeTrainData* _data );
    virtual void clear();

protected:
    ...
    CvBoost* ensemble;
};
  

The weak classifier, a component of boosted tree classifier CvBoost, is a derivative of CvDTree. Normally, there is no need to use the weak classifiers directly, however they can be accessed as elements of sequence CvBoost::weak, retrieved by CvBoost::get_weak_predictors.

Note, that in case of LogitBoost and Gentle AdaBoost each weak predictor is a regression tree, rather than a classification tree. Even in case of Discrete AdaBoost and Real AdaBoost the CvBoostTree::predict return value (CvDTreeNode::value) is not the output class label; a negative value "votes" for class #0, a positive - for class #1. And the votes are weighted. The weight of each individual tree may be increased or decreased using method CvBoostTree::scale. 数据挖掘研究院

CvBoost

Boosted tree classifier

数据挖掘研究院

class CvBoost : public CvStatModel
{
public:
    // Boosting type
    enum { DISCRETE=0, REAL=1, LOGIT=2, GENTLE=3 };

    // Splitting criteria
    enum { DEFAULT=0, GINI=1, MISCLASS=3, SQERR=4 };

    CvBoost();
    virtual ~CvBoost();

    CvBoost( const CvMat* _train_data, int _tflag,
             const CvMat* _responses, const CvMat* _var_idx=0,
             const CvMat* _sample_idx=0, const CvMat* _var_type=0,
             const CvMat* _missing_mask=0,
             CvBoostParams params=CvBoostParams() );

    virtual bool train( const CvMat* _train_data, int _tflag,
             const CvMat* _responses, const CvMat* _var_idx=0,
             const CvMat* _sample_idx=0, const CvMat* _var_type=0,
             const CvMat* _missing_mask=0,
             CvBoostParams params=CvBoostParams(),
             bool update=false );

    virtual float predict( const CvMat* _sample, const CvMat* _missing=0,
                           CvMat* weak_responses=0, CvSlice slice=CV_WHOLE_SEQ,
                           bool raw_mode=false ) const;

    virtual void prune( CvSlice slice );

    virtual void clear();

    virtual void write( CvFileStorage* storage, const char* name );
    virtual void read( CvFileStorage* storage, CvFileNode* node );

    CvSeq* get_weak_predictors();
    const CvBoostParams& get_params() const;
    ...

protected:
    virtual bool set_params( const CvBoostParams& _params );
    virtual void update_weights( CvBoostTree* tree );
    virtual void trim_weights();
    virtual void write_params( CvFileStorage* fs );
    virtual void read_params( CvFileStorage* fs, CvFileNode* node );

    CvDTreeTrainData* data;
    CvBoostParams params;
    CvSeq* weak;
    ...
};
 

数据挖掘研究院

CvBoost::train

Trains boosted tree classifier 数据挖掘实验室

bool CvBoost::train( const CvMat* _train_data, int _tflag,
             const CvMat* _responses, const CvMat* _var_idx=0,
             const CvMat* _sample_idx=0, const CvMat* _var_type=0,
             const CvMat* _missing_mask=0,
             CvBoostParams params=CvBoostParams(),
             bool update=false );
  

The train method follows the common template, the last parameter update specifies whether the classifier needs to be updated (i.e. the new weak tree classifiers added to the existing ensemble), or the classifier needs to be rebuilt from scratch. The responses must be categorical, i.e. boosted trees can not be built for regression, and there should be 2 classes.

CvBoost::predict

Predicts response for the input sample 数据挖掘研究院

float CvBoost::predict( const CvMat* sample, const CvMat* missing=0,
                        CvMat* weak_responses=0, CvSlice slice=CV_WHOLE_SEQ,
                        bool raw_mode=false ) const;
 

数据挖掘研究院

sample
The input sample.
missing
The optional mask of missing measurements. To handle missing measurements, the weak classifiers must include surrogate splits (see CvDTreeParams::use_surrogates).
weak_reponses
The optional output parameter, a floating-point vector, of responses from each individual weak classifier. The number of elements in the vector must be equal to the slice length.
slice
The continuous subset of the sequence of weak classifiers to be used for prediction. By default, all the weak classifiers are used.
raw_mode
It has the same meaning as in CvDTree::predict. Normally, it should be set to false.

The method CvBoost::predict runs the sample through the trees in the ensemble and returns the output class label based on the weighted voting.

CvBoost::prune

Removes specified weak classifiers 数据挖掘研究院

void CvBoost::prune( CvSlice slice );
 

数据挖掘研究院

The method removes the specified weak classifiers from the sequence. Note that this method should not be confused with the prunning of individual decision trees, which is currently not supported.

数据挖掘实验室

CvBoost::get_weak_predictors

Returns the sequence of weak tree classifiers

数据挖掘实验室

CvSeq* CvBoost::get_weak_predictors();
 数据挖掘研究院 

The method returns the sequence of weak classifiers. Each element of the sequence is a pointer to CvBoostTree class (or, probably, to some of its derivatives).

数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?