Technical Note for Statisticians
Some technical aspects of CART® analyses are of special interest to statisticians; we list the most important ones here.
CART is nonparametric.
CART is a nonparametric procedure and does not require specification of a functional form.
CART does not require variables to be selected
in advance.
CART uses a stepwise method to determine splitting rules. However, unlike parametric stepwise procedures, CART trees can be shown to be statistically sound. Thus, no advance selection of variables is necessary, although certain variables such as ID numbers and reformulations of the dependent variable should be excluded from the analysis. Also, CART®'s performance can be much enhanced by a judicious selection and creation of predictor variables.
Results are invariant with respect to monotone transformations of the independent variables.
There is no need to experiment with transformations of the independent variables, such as logarithms, square roots or squares. In CART, creating such variables will not affect the trees produced unless linear combination splits are used.
CART can handle data sets with a complex structure.
Unlike parametric models, which are intended to uncover a single dominant structure in data, CART is designed to work with data that might have multiple structures. In fact, provided there are enough observations, the more complex the data and the more variables available, the better CART will do compared to alternative methods.
CART is extremely robust to the effects of outliers.
Outliers among the independent variables generally do not affect CART because splits usually occur at non-outlier values. Outliers in the dependent variable are often separated into nodes where they no longer affect the rest of the tree. Also, in regression models, least absolute deviations can be used instead of least squares, diminishing the effect of outliers in the dependent variable.
CART can use any combination of categorical and continuous variables.
CART does not require any preprocessing of the data. In particular, and in contrast to CHAID, continuous variables do not have to be recoded into discrete variable versions prior to analysis.
CART can use linear combinations of variables to determine splits.
While the CART default is to split nodes on single variables, it will optionally use linear combinations of non-categorical variables.
CART can adjust for samples stratified on a categorical dependent variable.
If a sample has substantial over-representation of certain classes, CART can adjust for this by automatic reweighting.
CART can discover context dependence and interactions.
CART can use the same variable in different parts of the tree, uncovering the context dependency of the effects of certain variables.
CART can process cases with missing values for predictors.
For each split in the tree, CART develops alternative splits (surrogates), which can be used to classify an object when the primary splitting variable is missing. Thus, CART can be effectively used with data sets that have a large fraction of missing values.

