Salford Systems logo white space
Navigation
white space
white space
white space
white space
white space
Support > White papers > Do Splitting Rules Really Matter?
Do Splitting Rules Really Matter?


Introduction

Do decision-tree splitting criteria matter? Contrary to popular opinion in data mining circles, our experience indicates that splitting criteria do matter; in fact, that difference between using the right rule and the wrong rule could add up to millions of dollars of lost opportunity.

So, why haven't the differences been noticed? The answer is simple. When data sets are small and highly-accurate trees can be generated easily, the particular splitting rule does not matter. When your golf ball is one inch from the cup, which club or even which end you use is not important, you will be able to sink the ball in one stroke. Unfortunately, previous examinations of splitting rule performance, the ones that found no differences, did not look at data-mining problems with large data sets where obtaining a good answer is genuinely difficult.

When you are trying to detect fraud, identify borrowers who will declare bankruptcy in the next 12 months, target a direct mail campaign, or tackle other real-world business problems that do not admit of 90+ percent accuracy rates (with currently available data), the splitting rule you choose could materially affect the accuracy and value of your decision tree. Further, even when different splitting rules yield similarly accurate classifiers, the differences between them may still matter. With multiple classes, you might care how the errors are distributed across classes. Between two trees with equal overall error rates, you might prefer a tree that performs better on a particular class or classes. If the purpose of a decision tree is to yield insight into a causal process or into the structure of a database, splitting rules of similar accuracy can yield trees that vary greatly in their usefulness for interpreting and understanding the data.

This paper explores the key differences between three important splitting criteria Gini, Twoing and Entropy-for three- and greater-level classification trees and suggests how to choose the right one for a particular problem type. Although we can make recommendations as to which splitting rule is best suited to which type of problem, it is good practice to always use several splitting rules and compare the results. You should experiment with several different splitting rules and should expect different results from each. As you work with different types of data and problems, you will begin to learn which splitting rules typically work best for specific problem types. Nevertheless, you should never rely on a single rule alone; experimentation is always wise.


Gini, Twoing, and Entropy

The best known rules for binary recursive partitioning are Gini, Twoing, and Entropy. Because each rule represents a different philosophy as to the purpose of the decision tree, each grows a different style of tree.


Gini Splitting Rule

Gini looks for the largest class in your database (e.g., class A) and strives to isolate it from all other classes. For example, with four Classes, A, B, C, and D, representing 40, 30, 20, and 10 percent of the data, respectively, the Gini rule would immediately attempt to pull out the Class A records into one node. Of course, such a separation may not be possible using the available data, but, if it is, the Gini opts for that split. The diagram below shows the best possible Gini split for these data.



Once the first split is made, Gini continues attempting to split the data that require further segmentation, i.e., the right child node that contains Class B, C and D. Using the same strategy, Gini attempts to pull out all the Class B records, separating them from the other classes in the node. Gini then tackles the last heterogeneous node, striving to separate Class C from Class D. If the GINI rule is successful, the final tree would contain four "pure" child nodes:



A pure decision tree such as the above is attainable only in very rare circumstances; in most real world applications, database fields that clearly partition class from class are not available. If they were, no one would ever receive an unwelcome direct mail piece and bank losses on bad debts would be zero. Therefore, we cannot expect to grow trees like the one above routinely; however, Gini will try to come as close as possible to this ideal. A hypothetical example of a more realistic decision tree grown by Gini is displayed below. While imperfect, it is still an unusually accurate tree.



So what is Gini trying to do? Gini attempts to separate classes by focusing on one class at a time. It will always favor working on the largest or, if you use costs or weights, the most "important" class in a node. While this approach might seem short sighted, Gini performance is frequently so good that you should always experiment with it to see how well it does. Gini is the default rule in CART precisely because it is so often the best splitting rule.


Twoing and Entropy Splitting Rules

The philosophy of Twoing is far different than that of Gini. Rather than initially pulling out a single class, Twoing first segments the classes into two groups, attempting to find groups that together add up to 50 percent of the data. Twoing then searches for a split to separate the two subgroups. The diagram below shows the best possible split the Twoing rule could find.



Again, this is an ideal split. It is unlikely that any real-world database would allow you to cleanly separate four important classes into two subgroups in this way. However, splits that approach this ideal might be possible, and these are the splits that Twoing seeks to find. The Entropy rule, which is very similar to Twoing in practice, strives for similar splits.


Variations on Twoing

An important variation of the Twoing rule is Power-Modified Twoing, which places an even heavier weight on splitting the data in a node into two equal-sized partitions. When the perfect splits illustrated above are available, Twoing and Power-Modified Twoing will select the same split. When only imperfect partitions are available, Power-Modified Twoing is more likely to generate a near 50-50 partition of the data than is simple Twoing.


General Rules of Thumb

The following rules of thumb are based on our experience in the telecommunications, banking, and market research arenas, and may not apply literally to other subject matters or even other data sets. Nevertheless, they represent such a consistent set of empirical findings that we expect them to continue to hold in other domains and data sets.

For a two-level dependent variable that can be predicted with a relative error of less than 0.50, the Gini splitting rule is typically best. For a two-level dependent variable that can be predicted with a relative error of 0.80 or higher, Power-Modified Twoing tends to perform best. For target variables with four to nine levels, Twoing has a good chance of being the best splitting rule. For higher-level categorical dependent variables with 10 or more levels, Twoing or Power- Modified Twoing is often considerably more accurate than Gini. Take the last rule of thumb, for example. A data-mining project concerning consumer choice from a set of 28 vehicles is a typical example of the substantial differences that can be observed across alternative splitting rules. The list of variables in this exercise was quite limited; in particular, the trees needed to be grown without reference to important drivers of choice such as income or price, so the level of accuracy attainable was expected to be low. In the first run, using CART's default criterion, the Gini rule grew an 89-node tree with a relative error of 0.919. The second run using CART®'s Twoing rule grew a tree with 50 nodes with a relative error of 0.876.

Reviewing the tree sequences below, one can see that the Twoing rule makes a better first split and continues to pull ahead of Gini on progressively larger trees. Although the second tree also has quite a high error rate in percentage of error terms, it is a dramatic improvement over the first. In this example of vehicle choice, the Twoing rule was the winner; in other data sets, the winner might be Power-Modified Twoing, Gini, or yet another rule.


CART Tree Sequence for 28 Level Target Variable Using GINI
# of Terminal
Tree Nodes
Test Set
Relative Cost
Resubstitution
Relative Cost
Complexity
Parameter
100 0.91973 +/- 0.00636 0.80182 0.00076
95 0.92016 +/- 0.00636 0.80591 0.00079
94 0.92054 +/- 0.00636 0.80678 0.00085
89** (optimal) 0.91921 +/- 0.00636 0.81140 0.00090
88 0.92348 +/- 0.00636 0.81238 0.00095
78 0.92498 +/- 0.00630 0.82236 0.00097
70 0.92259 +/- 0.00639 0.83084 0.00103
19 0.94185 +/- 0.00657 0.90447 0.00236
18 0.94393 +/- 0.00649 0.90696 0.00241
17 0.94393 +/- 0.00649 0.90970 0.00265
13 0.94903 +/- 0.00650 0.92238 0.00306
4 0.97052 +/- 0.00640 0.95574 0.00358
3 0.97227 +/- 0.00566 0.96447 0.00843
2 0.97795 +/- 0.00399 0.97884 0.01386
1 1.00000 +/- 0.00004 1.00000 0.02042



CART Tree Sequence for 28 Level Target Variable Using TWOING
# of Terminal
Tree Nodes
Test Set
Relative Cost
Resubstitution
Relative Cost
Complexity
Parameter
55 0.88054 +/- 0.00643 0.77145 0.00112
54 0.88067 +/- 0.00648 0.772064 0.00116
53 0.88067 +/- 0.00648 0.77384 0.00117
51 0.87681 +/- 0.00657 0.77632 0.00121
50** (optimal) 0.87623 +/- 0.00660 0.77764 0.00128
45 0.87697 +/- 0.00653 0.78517 0.00146
43 0.87697 +/- 0.00653 0.78819 0.00147
40 0.88095 +/- 0.00648 0.79281 0.00150
15 0.88822 +/- 0.00703 0.84970 0.00351
14 0.89282 +/- 0.00698 0.85413 0.00428
7 0.90537 +/- 0.00544 0.89082 0.00585
5 0.91961 +/- 0.00513 0.90667 0.00765
4 0.93286 +/- 0.00490 0.92311 0.01586
3 0.94789 +/- 0.00380 0.94347 0.01964
2 0.96803 +/- 0.00217 0.96833 0.02398
1 1.00000 +/- 0.00004 1.00000 0.03055



Conclusion

Which splitting rule you choose when growing a decision tree does matter; sometimes the differences between rules will be modest and at other times profound. We have found numerous examples where judicious choice of a splitting rule reduces the error rate of a classification tree by five to ten percent. But the differences go beyond accuracy. Even when the overall accuracy of two trees grown by different splitting rules is identical, the usefulness of the trees for revealing data structure can be quite different. There will also be times when the value of a decision tree is determined by how rich the best nodes are in certain values of the target variable, and the overall accuracy of the tree is irrelevant. In such circumstances, the differences between Gini and Twoing could easily be the difference between success and failure (e.g., in the case of fraud detection).

Given that there is no one best splitting rule for all problem types or all purposes, it is important that the decision-tree tool you employ offer several proven splitting rules with distinct styles and operating characteristics.


Further Reading:

Breiman, L., J. Friedman, R. Olshen and C. Stone. Classification and Regression Trees, Pacific Grove: Wadsworth, 1984.

Breiman, L. Some Properties of Splitting Criteria, Statistics Department, University of California, Berkeley. 1992.
white space
© Copyright 2003-2004 Salford Systems - Print this page white space