Salford Systems logo white space
Navigation
white space
white space
white space
white space
white space
Products > CART > Technical Overview > Frequently Asked Questions > COMBINE (Bagging, ARCing) Command
COMBINE Command Implementing Bagging and ARCing


The COMBINE command allows one to choose from several ways of combining separate CART trees into a single predictive engine. The trees are combined by either averaging their outputs for regression or by using an unweighted plurality voting scheme for classification. The current version of CART offers two combination methods: Bootstrap aggregation and ARCing. Each generates a set of trees by resampling (with replacement) from the original training data.

When training data are resampled with replacement, the effect is to create a new version of the data that is a slightly "perturbed" version of the original. Some original training cases are excluded from the new training sample, and other cases are included multiple times. Typically, 37% of the original cases are not included at all in the resample; the sample is brought up to full size by including other cases more than once. A handful of cases will be replicated 4, 5, 6, or even 7 times, although the most common replication counts are 1 and 2. The effect of this resampling is to randomly alter the weights that cases will have in any analysis, thus shifting slightly the results obtained from tree growing or any other type of statistical analysis.

Bootstrap resampling was originally developed to help analysts determine how much their results might have changed if another random sample had been used instead, or how different results might be when a model is applied to new data. The theory of the bootstrap was developed by Stanford's Brad Efron in 1979 and has been studied extensively since then. For data mining applications, Leo Breiman applied the bootstrap in a novel way: the bootstrap is used to generate many versions of the data set or replications, a separate analysis is conducted for each replication, and then the results are averaged. If the separate analyses differ considerably from each other (suggesting tree instability), the averaging will stabilize the results and yield much more accurate predictions. If the separate analyses are very similar to each other, the trees exhibit stability and the averaging will not harm or improve the predictions. Thus, the more unstable the trees, the greater the benefits of averaging.

Note that COMBINE generates a "committee of experts" rather than a single optimal tree. Because a single tree is thus not displayed, there is no simple way to explain the underlying rationale driving the COMBINE predictions. In this sense, combined trees are somewhat akin to the black box of a neural net, although the trees are built much faster.

Two different resampling technologies are currently available within CART:
  1. Bootstrap Aggregation (or bagging) in which each new resample is drawn in the identical way, and
  2. ARCing (Adaptive Resampling and Combining), or ADAPTIVE resampling, in which the way a sample is drawn for the next tree depends on the performance of prior trees.
Freund and Schapire (1996) first introduced ARCing (a.k.a. boosting); Breiman (1996) introduced bagging and demonstrated that it performs as well or better than boosting. In general, we recommend bagging rather than ARCing as bagging is more robust with dependent variable errors and is also much faster. Nevertheless, ARCing is capable of yielding some remarkable reductions in predictive error.

One final caution on combining via bagging or ARCing: the increase in accuracy is often accomplished for the class one has least interest in. For example, in a binary response model in which response is relatively rare, bagging and ARCing may improve the non-response classification accuracy while slightly reducing the response classification accuracy relative to a standard CART tree. You will probably need to adjust priors to induce the most useful improvements.


Using COMBINE


To invoke bagging, simply issue the combine command as in

COMBINE

To invoke ARCing, enter

COMBINE ARC=YES, POWER=4

POWER sets the weight the resampling puts on selecting cases that have been previously misclassified. The higher the power, the greater the bias against selecting cases that were previously classified correctly. Breiman has found that POWER=4 works quite well; POWER=1 or 2 gives results virtually identical to bagging.

Setting POWER to numbers higher than four could make it difficult to locate a sample large enough to fill the training sample if only a small fraction of the data are misclassified. Also, as Dietterich (1998) has reported, if the dependent variable is recorded in error, then using ARCing will progressively focus new trees on the bad data and yield poor predictive models.

After selecting bagging or ARCing, the next step is to select the number of trees to be grown. Bagging typically shows good results once 25 trees have been grown, but ARCing may require 100 to 250 trees. The number of trees is set with the CYCLES option, as in

COMBINE ARC=no, CYCLES=25

Experiment with a modest number to see how the procedure is working, and if it looks promising launch a CART run with a full complement of 50 or more trees.

When growing a single tree, pruning is not just optional, it is vital to obtaining a reliable tree. By definition, a CART tree is first overgrown (i.e., overfit) and then the overfit portions are pruned away with the help of a test set. When combining trees, Breiman has shown that the trees need not be pruned! Whatever overfitting there might be is averaged away when the combining takes place. You might wish to verify this for yourself.

The options controlling pruning are

[EXPLORE | TEST=SAMPLE | LEARN ]

where EXPLORE means do not prune, TEST=SAMPLE means draw a new sample to use for testing and pruning, and LEARN means use the entire training data sample (i.e., the sample from which the resamples are drawn) for testing.

Note that when a bootstrap or ARCed resample is drawn via sampling with replacement, only about 63% of the data will be selected into the sample and 37% will be excluded. The sampling with replacement will result in some cases being drawn into the sample multiple times; it is this perturbation of the data that allows the combining method to work. The portion of the data not drawn into the sample in a replication is known as the "out-of-bag" data and can legitimately be used for a variety of testing purposes. When the entire set of LEARN data is used to prune a tree, a mixture of "in-bag" and "out-of-bag" data is used, and the mixture is in fact optimal for pruning.

Although you may prune the trees, the recommended settings are

COMBINE EXPLORE

coupled with a judicious setting of the complexity parameter so that trees are not grown too large on huge databases.

Finally, it is highly advisable to specify a SETASIDE data set. This data set is a genuine holdout sample and is used to compare the performance of a standard CART tree, the combined set of trees, as well as any other classification or regression scheme you care to consider. The SETASIDE data can be selected randomly from the input file, be identified via an indicator variable, or reside in a separate file. The options are
  • SETASIDE=PROP=p; where p is a number between 0 and 1,
  • SETASIDE=SEPVAR=variable; where variable is equal to 1 for setaside status and 0 otherwise, or
  • SETASIDE=FILE=filename; where filename is a separate dataset accessible by CART.
A sample command for bagging might then be:

COMBINE EXPLORE, CYCLES=50, SETASIDE=FILE="HOLDOUT.SYS"

while for ARCing a command might be:

COMBINE EXPLORE CYCLES=50, ARC=YES, POWER=4, SETASIDE=FILE="HOLDOUT.SYS"

Other options for the COMBINE command can be used to save individual tree files, save individual samples, produce detailed output on every tree grown, and to control details of the ARC resampling process (see Appendix below).


ARCing Fine Points

Combining methods require the growth of multiple trees all grown under slightly different conditions. Differences can occur in the control settings for the tree, such as when priors are systematically varied, or in the data being used for tree training. In bagging, the data are perturbed by resampling with replacement. The replacement induces a different set of weights on the training data. Bootstrap resampling always leaves some of the data out, reducing their weight to zero, and incorporating other data multiple times, increasing the weight of data included in the resample. Bootstrap resampling can always be conducted without fail. The same cannot be said for an ARC resample, however.

In ARCing, the probability with which a case is selected for the next training set is not constant and is not equal for all cases in the original learn data set; instead, the probability of selection increases with the frequency with which a case has been misclassified in previous trees. Cases that are very difficult to classify receive an increasing probability of selection while other cases that are classified correctly receive declining weights from trial to trial. As the probability of selection becomes more skewed in favor of the difficult-to-classify cases, the probability of selection for the typical case quickly declines to zero and the process of sample building takes an increasingly long time.

One of the ARCing options is a setting controlling how hard the ARCer should try to build a sample. In many runs the ARC process of resampling will simply bog down and the ARCer will automatically reset the probabilities to their equal starting values and continue generating additional trees.

For an overview of Bootstrapping and ARCing methodology, also see Committee of Experts.
white space
© Copyright 2003-2004 Salford Systems - Print this page white space