Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

induction of decision tree in artificial intelligence, Study Guides, Projects, Research of Data Mining

induction of decision tree in artificial intelligence

Typology: Study Guides, Projects, Research

2015/2016

Uploaded on 11/22/2016

nikhatshaikh655
nikhatshaikh655 🇮🇳

2 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning 1: 81-106, 1986
© 1986 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands
Induction of Decision Trees
J.R. QUINLAN (munnari! nswitgould.oz! quinlan@ seismo.css.gov)
Centre for Advanced Computing Sciences, New South Wales Institute of Technology, Sydney 2007,
Australia
(Received August 1, 1985)
Key words: classification, induction, decision trees, information theory, knowledge acquisition, expert
systems
Abstract.
The technology for building knowledge-based systems by inductive inference from examples has
been demonstrated successfully in several practical applications. This paper summarizes an approach to
synthesizing decision trees that has been used in a variety of systems, and it describes one such system,
ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal
with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is
discussed and two means of overcoming it are compared. The paper concludes with illustrations of current
research directions.
1. Introduction
Since artificial intelligence first achieved recognition as a discipline in the mid 1950's,
machine learning has been a central research area. Two reasons can be given for this
prominence. The ability to learn is a hallmark of intelligent behavior, so any attempt
to understand intelligence as a phenomenon must include an understanding of learn-
ing. More concretely, learning provides a potential methodology for building high-
performance systems.
Research on learning is made up of diverse subfields. At one extreme there are
adaptive systems that monitor their own performance and attempt to improve it by
adjusting internal parameters. This approach, characteristic of a large proportion of
the early learning work, produced self-improving programs for playing games
(Samuel, 1967), balancing poles (Michie, 1982), solving problems (Quinlan, 1969)
and many other domains. A quite different approach sees learning as the acquisition
of structured knowledge in the form of concepts (Hunt, 1962; Winston, 1975),
-discrimination nets (Feigenbaum and Simon, 1963), or production rules (Buchanan,
1978).
The practical importance of machine learning of this latter kind has been underlin-
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download induction of decision tree in artificial intelligence and more Study Guides, Projects, Research Data Mining in PDF only on Docsity!

Machine Learning 1: 81-106, 1986

© 1986 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands

Induction of Decision Trees

J.R. QUINLAN (munnari! nswitgould.oz! quinlan@ seismo.css.gov) Centre f o r Advanced Computing Sciences, New South Wales Institute o f Technology, Sydney 2007, Australia (Received August 1, 1985)

Key words: classification, induction, decision trees, information theory, knowledge acquisition, expert systems

Abstract. The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

1. Introduction

Since artificial intelligence first achieved recognition as a discipline in the mid 1950's,

machine learning has been a central research area. Two reasons can be given for this

prominence. The ability to learn is a hallmark of intelligent behavior, so any attempt

to understand intelligence as a phenomenon must include an understanding of learn-

ing. More concretely, learning provides a potential methodology for building high-

performance systems.

Research on learning is made up of diverse subfields. At one extreme there are

adaptive systems that monitor their own performance and attempt to improve it by

adjusting internal parameters. This approach, characteristic of a large proportion of

the early learning work, produced self-improving programs for playing games

(Samuel, 1967), balancing poles (Michie, 1982), solving problems (Quinlan, 1969)

and many other domains. A quite different approach sees learning as the acquisition

of structured knowledge in the form of concepts (Hunt, 1962; Winston, 1975),

-discrimination nets (Feigenbaum and Simon, 1963), or production rules (Buchanan,

The practical importance of machine learning of this latter kind has been underlin-

82 J.R. QUINLAN

ed by the advent of knowledge-based expert systems. As their name suggests, these

systems are powered by knowledge that is represented explicitly rather than being im-

plicit in algorithms. The knowledge needed to drive the pioneering expert systems was

codified through protracted interaction between a domain specialist and a knowledge

engineer. While the typical rate of knowledge elucidation by this method is a few

rules per man day, an expert system for a complex task may require hundreds or even

thousands of such rules. It is obvious that the interview approach to knowledge ac-

quisition cannot keep pace with the burgeoning demand for expert systems; Feigen-

baum (1981) terms this the 'bottleneck' problem. This perception has stimulated the

investigation of machine learning methods as a means of explicating knowledge

(Michie, 1983).

This paper focusses on one microcosm of machine learning and on a family of

learning systems that have been used to build knowledge-based systems of a simple

kind. Section 2 outlines the features of this family and introduces its members. All

these systems address the same task of inducing decision trees from examples. After

a more complete specification of this task, one system (ID3) is described in detail in

Section 4. Sections 5 and 6 present extensions to ID3 that enable it to cope with noisy

and incomplete information. A review of a central facet of the induction algorithm

reveals possible improvements that are set out in Section 7. The paper concludes with

two novel initiatives that give some idea of the directions in which the family may

grow.

2. The TDIDT family of learning systems

Carbonell, Michalski and Mitchell (1983) identify three principal dimensions along

which machine learning systems can be classified:

  • the underlying learning strategies used;
  • the representation of knowledge acquired by the system; and
  • the application domain of the system.

This paper is concerned with a family of learning systems that have strong common

bonds in these dimensions.

Taking these features in reverse order, the application domain of these systems is

not limited to any particular area of intellectual activity such as Chemistry or Chess;

they can be applied to any such area. While they are thus general-purpose systems,

the applications that they address all involve classification. The product of learning

is a piece of procedural knowledge that can assign a hitherto-unseen object to one

of a specified number of disjoint classes. Examples of classification tasks are:

84 J.R. QUINLAN

CLS (1963) I ID3 (1979)

I^ I ACLS (1981)

Expert- Ease (1983) EX- TRAN (1984) RuleMaster (1984)

ASSISTANT (1984)^ I

Figure 1. The TDIDT family tree.

u n c o m m o n cases that have not been encountered during the period of record- keeping. On the other hand, the objects might be a carefully culled set of tutorial ex- amples prepared by a domain expert, each with some particular relevance to a com- plete and correct classification rule. The expert might take pains to avoid redundancy and to include examples of rare cases. While the family of systems will deal with col- lections of either kind in a satisfactory way, it should be mentioned that earlier T D I D T systems were designed with the 'historical record' approach in mind, but all systems described here are now often used with tutorial sets (Michie, 1985). Figure 1 shows a family tree of the T D I D T systems. The patriarch of this family is H u n t ' s Concept Learning System framework (Hunt, Marin and Stone, 1966). CLS constructs a decision tree that attempts to minimize the cost of classifying an object. This cost has components of two types: the measurement cost of determining the value of property A exhibited by the object, and the misclassification cost of deciding that the object belongs to class J when its real class is K. CLS uses a lookahead strategy similar to minimax. At each stage, CLS explores the space of possible deci- sion trees to a fixed depth, chooses an action to minimize cost in this limited space, then moves one level down in the tree. Depending on the depth of lookahead chosen, CLS can require a substantial amount of computation, but has been able to unearth subtle patterns in the objects shown to it. ID3 (Quinlan, 1979, 1983a) is one of a series of programs developed from CLS in response to a challenging induction task posed by Donald Michie, viz. to decide from pattern-based features alone whether a particular chess position in the King-Rook vs King-Knight endgame is lost for the Knight's side in a fixed number of ply. A full description of ID3 appears in Section 4, so it is sufficient to note here that it embeds a tree-building method in an iterative outer shell, and abandons the cost-driven lookahead of CLS with an information-driven evaluation function. ACLS (Paterson and Niblett, 1983) is a generalization of ID3. CLS and ID3 both require that each property used to describe objects has only values from a specified set. In addition to properties of this type, ACLS permits properties that have

INDUCTION OF DECISION TREES 85

unrestricted integer values. The capacity to deal with attributes of this kind has allow-

ed ACLS to be applied to difficult tasks such as image recognition (Shepherd, 1983).

ASSISTANT (Kononenko, Bratko and Roskar, 1984) also acknowledges ID3 as

its direct ancestor. It differs from ID3 in many ways, some of which are discussed

in detail in later sections. ASSISTANT further generalizes on the integer-valued at-

tributes of ACLS by permitting attributes with continuous (real) values. Rather than

insisting that the classes be disjoint, ASSISTANT allows them to form a hierarchy,

so that one class may be a finer division of another. ASSISTANT does not form a

decision tree iteratively in the manner of ID3, but does include algorithms for choos-

ing a 'good' training set from the objects available. ASSISTANT has been used in

several medical domains with promising results.

The bottom-most three systems in the figure are commercial derivatives of ACLS.

While they do not significantly advance the underlying theory, they incorporate

many user-friendly innovations and utilities that expedite the task of generating and

using decision trees. They all have industrial successes to their credit. Westinghouse

Electric's Water Reactor Division, for example, points to a fuel-enrichment applica-

tion in which the company was able to boost revenue by 'more than ten million

dollars per annum' through the use of one of them. 1

3. The induction task

We now give a more precise statement of the induction task. The basis is a universe

of objects that are described in terms of a collection of attributes. Each attribute

measures some important feature of an object and will be limited here to taking a

(usually small) set of discrete, mutually exclusive values. For example, if the objects

were Saturday mornings and the classification task involved the weather, attributes

might be

outlook, with values {sunny, overcast, rain]

temperature, with values {cool, mild, hot]

humidity, with values {high, normal]

windy, with values {true, false ]

Taken together, the attributes provide a zeroth-order language for characterizing ob-

jects in the universe. A particular Saturday morning might be described as

outlook: overcast

temperature: cool

humidity: normal

windy: false

1 Letter cited in the journal Expert Systems (January, 1985), p. 20.

I N D U C T I O N OF DECISION TREES 87

sunny overcast rain

high normal

......../ \ , NI iP:

true false

. / \ ...... N~ ~ P

Figure 2. A simple decision tree

the class named by the leaf. Taking the decision tree of Figure 2, this process con- cludes that the object which appeared as an example at the start of this section, and which is not a m e m b e r of the training set, should belong to class P. Notice that only a subset of the attributes m a y be encountered on a particular path from the root of the decision tree to a leaf; in this case, only the outlook attribute is tested before determining the class. If the attributes are adequate, it is always possible to construct a decision tree that correctly classifies each object in the training set, and usually there are many such correct decision trees. The essence of induction is to move beyond the training set, i.e. to construct a decision tree that correctly classifies not only objects from the training set but other (unseen) objects as well. In order to do this, the decision tree must capture some meaningful relationship between an object's class and its values of the attributes. Given a choice between two decision trees, each o f which is correct over the training set, it seems sensible to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set. The decision tree of Figure 3, for instance, is also correct for the training set of Table 1, but its greater complexity makes it suspect as an 'explanation' of the training set. 2

4. ID

One approach to the induction task above would be to generate all possible decision trees that correctly classify the training set and to select the simplest of them. The

z The preference for simpler trees, presented here as a c o m m o n s e n s e application o f Occam's Razor, is also supported by analysis. Pearl (1978b) and Quinlan (1983a) have derived upper bounds on the expected error using different formalisms for generalizing from a set o f known cases. For a training set of predeter- mined size, these b o u n d s increase with the complexity o f the induced generalization.

88 J.R. QUINLAN

P:

f^ J

sunny o'cast rain

p~

true false / \

I temperature 1

.... I ~ - r r ~ hot.

sunny o'cast rain " I

t r u e false / \

P: P" N.

high normal

true false / :NI iP:.

true false

high normal

sunny o'cast rain ..... N / I. :.P/: (^) \ null

Figure 3. A complex decision tree.

number o f such trees is finite but very large, so this approach would only be feasible for small induction tasks. ID3 was designed for the other end of the spectrum, where there are m a n y attributes and the training set contains m a n y objects, but where a reasonably good decision tree is required without much computation. It has generally been found to construct simple decision trees, but the approach it uses cannot guarantee that better trees have not been overlooked. The basic structure of ID3 is iterative. A subset of the training set called the win- dow is chosen at r a n d o m and a decision tree formed from it; this tree correctly classifies all objects in the window. All other objects in the training set are then classified using the tree. If the tree gives the correct answer for all these objects then it is correct for the entire training set and the process terminates. If not, a selection of the incorrectly classified objects is added to the window and the process continues. In this way, correct decision trees have been found after only a few iterations for training sets of up to thirty thousand objects described in terms of up to 50 attributes. Empirical evidence suggests that a correct decision tree is usually found more quickly by this iterative method than by forming a tree directly from the entire training set. However, O'Keefe (1983) has noted that the iterative framework cannot be guaranteed to converge on a final tree unless the window can grow to include the en- tire training set. This potential limitation has not yet arisen in practice. The crux of the problem is how to form a decision tree for an arbitrary collection C of objects. I f C is empty or contains only objects of one class, the simplest decision tree is just a leaf labelled with the class. Otherwise, let T be any test on an object with possible outcomes O1, Oz.... Ow. Each object in C will give one of these outcomes for T, so T produces a partition [ C1, C2.... Cw} of C with Ci containing those ob-

90 J.R. QUINLAN

information required for the subtree for Ci is I(pi, ni). The expected information re- quired for the tree with A as root is then obtained as the weighted average

V E(A) = Z Pi + n i I(pi, ni) i = l p + n

where the weight for the ith branch is the proportion of the objects in C that belong to Ci. The information gained by branching on A is therefore

gain(A) = I(p, n) - E(A)

A good rule of thumb would seem to be to choose that attribute to branch on which gains the most information. 3 ID3 examines all candidate attributes and chooses A to maximize gain(A), forms the tree as above, and then uses the same process recursively to form decision trees for the residual subsets C1, CE.... Cv. To illustrate the idea, let C be the set of objects in Table 1. O f the 14 objects, 9 are of class P and 5 are of class N, so the information required for classification is

9 logz 9 5 log2 5 0.940 bits

I(p, n) - 14 14 14 14 =

Now consider the outlook attribute with values [ sunny, overcast, rain }. Five of the 14 objects in C have the first value (sunny), two of them from class P and three from class N, so

p l = 2

and similarly

Pz = 4 P3 = 3

nl = 3 I ( p l , h i ) = 0.

n2 = 0 I(p2, n2) = 0 n3 = 2 I(p3, n3) = 0. 9 7 1

The expected information requirement after testing this attribute is therefore

E (outlook) = ]-~ I(pl, nl)^5 + ] ~ I (P2, n2)^4 + ] ~ I(p3, n3)^5

= 0.694 bits

3 Since l(p,n) is constant for all attributes, maximizing the gain is equivalent to minimizing E(A), which is the mutual information Of the attribute A and the class. Pearl (1978a) contains an excellent account o f the rationale of information-based heuristics.

INDUCTION OF DECISION TREES 91

The gain of this attribute is then

gain(outlook) = 0.940 - E(outlook) = 0.246 bits

Similar analysis gives

gain(temperature) = 0.029 bits gain(humidity) = 0.151 bits gain(windy) = 0.048 bits

so the tree-forming method used in ID3 would choose outlook as the attribute for the root of the decision tree. The objects would then be divided into subsets according to their values of the outlook attribute and a decision tree for each subset would be induced in a similar fashion. In fact, Figure 2 shows the actual decision tree generated by ID3 from this training set. A special case arises if C contains no objects with some particular value Aj of A, giving an empty Cj. ID3 labels such a leaf as 'null' so that it fails to classify any object arriving at that leaf. A better solution would generalize from the set C from which Cj came, and assign this leaf the more frequent class in C. The worth of ID3's attribute-selecting heuristic can be assessed by the simplicity of the resulting decision trees, or, more to the point, by how well those trees express real relationships between class and attributes as demonstrated by the accuracy with

which they classify objects other than those in the training set (their predictive ac-

curacy). A straightforward method of assessing this predictive accuracy is to use only

'part of the given set o f objects as a training set, and to check the resulting decision tree on the remainder. Several experiments of this kind have been carried out. In one domain, 1.4 million Chess positions described in terms of 49 binary-valued attributes gave rise to 715 distinct objects divided 65% :35% between the classes. This domain is relatively com- plex since a correct decision tree for all 715 objects contains about 150 nodes. When training sets containing 20% of these 715 objects were chosen at random, they pro- duced decision trees that correctly classified over 84% of the unseen objects. In another version of the same domain, 39 attributes gave 551 distinct objects with a correct decision tree of similar size; training sets of 20% of these 551 objects gave decision trees of almost identical accuracy. In a simpler domain (1,987 objects with a correct decision tree of 48 nodes), randomly-selected training sets containing 20°7o o f the objects gave decision trees that correctly classified 98% of the unseen objects. In all three cases, it is clear that the decision trees reflect useful (as opposed to ran- dom) relationships present in the data. This discussion of ID3 is rounded o f f by looking at the computational require- ments of the procedure. At each non-leaf node of the decision tree, the gain of each untested attribute A must be determined. This gain in turn depends on the values pi

INDUCTION OF DECISION TREES 93

(1) The algorithm must be able to work with inadequate attributes, because noise can cause even the most comprehensive set of attributes to appear inadequate. "(2) The algorithm must be able to decide that testing further attributes will not im- prove the predictive accuracy of the decision tree. In the last example above, it should refrain from increasing the complexity of the decision tree to accom- modate a single noise-generated special case.

We start with the second requirement o f deciding when an attribute is really rele- vant to classification. L e t C be a collection of objects containing representatives of both classes, and let A be an attribute with r a n d o m values that produces subsets { C1, C2.... Cv 1. Unless the proportion of class P objects in each of the Ci is exactly the same as the proportion of class P objects in C itself, branching on attribute A will give an apparent information gain. It will therefore appear that testing attribute A is a sensible step, even though the values of A are r a n d o m and so cannot help to classify the objects in C. One solution to this dilemma might be to require that the information gain of any tested attribute exceeds some absolute or percentage threshold. Experiments with this approach suggest that a threshold large enough to screen out irrelevant attributes also excludes attributes that are relevant, and the performance of the tree-building pro- cedure is degraded in the noise-free case. An alternative method based on the chi-square test for stocbastic independence has been found to be more useful. In the previous notation, suppose attribute A produces subsets 1C1, C2.... Cv} of C, where Ci contains pi and ni objects of class P and N, respectively. I f the value of A is irrelevant to the class of an object in C, the expected ~?alue p ' i of pi should be

P'i = P • p i + n i p + n

I f n ' i is the corresponding expected value of ni, the statistic

Zv^ (Pi--P^ ,)2^ + ( n i - - n ' i ) 2

i=1 P~i n t i

is approximately chi-square with v-1 degrees of freedom. Provided that none of the values p ' i or n ' i are very small, this statistic can be used to determine the confidence with which one can reject the hypothesis that A is independent of the class of objects in C (Hogg and Craig, 1970). The tree-building procedure can then be modified to prevent testing any attribute whose irrelevance cannot be rejected with a very high (e.g. 99%) confidence level. This has been found effective in preventing over-

94 J.R. QUINLAN

c o m p l e x trees that a t t e m p t to 'fit the noise' w i t h o u t affecting p e r f o r m a n c e o f the p r o - cedure in the noise-free case. 4 T u r n i n g now to the first r e q u i r e m e n t , we see t h a t the following s i t u a t i o n can arise: a collection of C objects m a y c o n t a i n representatives o f b o t h classes, yet further testing o f C m a y be ruled out, either b e c a u s e the a t t r i b u t e s are i n a d e q u a t e a n d u n a b l e to distinguish a m o n g the o b j e c t s in C, or b e c a u s e each a t t r i b u t e has been j u d g e d to be irrelevant to the class o f objects in C. In this s i t u a t i o n it is necessary to p r o d u c e a leaf labelled with class i n f o r m a t i o n , b u t the objects in C are not all o f the same class. T w o possibilities suggest themselves. T h e n o t i o n o f class could be generalized to allow the value p / ( p + n) in the interval (0,1), a class o f 0.8 (say) being i n t e r p r e t e d as ' b e l o n g i n g to class P with p r o b a b i l i t y 0.8'. A n alternative a p p r o a c h w o u l d be to o p t for the m o r e n u m e r o u s class, i.e. to assign the leaf to class P if p > n, to class N if p < n, and to either if p = n. The first a p p r o a c h minimizes the sum o f the squares o f the error over objects in C, while the second minimizes the sum o f the ab- solute errors over o b j e c t s in C. If the aim is to m i n i m i z e expected error, the second a p p r o a c h might be a n t i c i p a t e d to be superior, a n d indeed this has been f o u n d to be the case. Several studies have been carried out to see how this m o d i f i e d p r o c e d u r e holds up u n d e r v a r y i n g levels o f noise ( Q u i n l a n 1983b, 1985a). One such study is o u t l i n e d here b a s e d on the e a r l i e r - m e n t i o n e d t a s k with 551 o b j e c t s a n d 39 b i n a r y - v a l u e d a t t r i b u t e s. In each e x p e r i m e n t , the whole set o f o b j e c t s was artificially c o r r u p t e d as described b e l o w a n d used as a training set to p r o d u c e a decision tree. E a c h o b j e c t was then cor- r u p t e d anew, classified b y this tree a n d the e r r o r rate d e t e r m i n e d. This process was r e p e a t e d twenty times to give m o r e reliable averages. In this study, values were c o r r u p t e d as follows. A noise level o f n percent a p p l i e d to a value m e a n t that, with p r o b a b i l i t y n percent, the true value was r e p l a c e d by a value chosen at r a n d o m f r o m a m o n g the values that could have a p p e a r e d. 5 T a b l e 2 shows the results when noise levels varying f r o m 5% to 100% were a p p l i e d to the values o f the m o s t noise-sensitive a t t r i b u t e , to the values o f all a t t r i b u t e s s i m u l t a n e o u s l y , a n d to the class i n f o r m a t i o n. This table d e m o n s t r a t e s the quite dif- ferent f o r m s o f d e g r a d a t i o n observed. D e s t r o y i n g class i n f o r m a t i o n p r o d u c e s a linear increase in error so t h a t , when all class i n f o r m a t i o n is noise, the resulting deci- sion tree classifies objects entirely r a n d o m l y. Noise in a single a t t r i b u t e does not have a d r a m a t i c effect. Noise in all a t t r i b u t e s together, however, leads to a relatively r a p i d increase in error which reaches a p e a k a n d declines. The peak is s o m e w h a t inter-

4 ASSISTANT uses an information-based measure to perform much the same function, but no com- parative results are available to date. 5 It might seem that the value should be replaced by an incorrect value. Consider, however, the case of a two-valued attribute corrupted with 100% noise. If the value of each object were replaced by the (on- ly) incorrect value, the initial attribute will have been merely inverted with no loss of information.

96 J.R. QUINLAN

noise level in class information results in a 3°70 degradation. Comparable figures have been obtained for other induction tasks. One interesting point emerged from other experiments in which a correct decision tree formed from an uncorrupted training set was used to classify objects whose descriptions were corrupted. This scenario corresponds to forming a classification rule under controlled and sanitized laboratory conditions, then using it to classify ob- jects in the field. For higher noise levels, the performance of the correct decision tree on corrupted data was found to be inferior to that of an imperfect decision tree form- ed from data corrupted to a similar level! (This phenomenon has an explanation similar to that given above for the peak in Table 2.) The moral seems to be that it is counter-productive to eliminate noise from the attribute information in the train- ing set if these same attributes will be subject to high noise levels when the induced decision tree is put to use.

6. U n k n o w n attribute values

The previous section examined modifications to the tree-building process that en- abled it to deal with noisy or corrupted values. This section is concerned with an allied problem that also arises in practice: unknown attribute values. To continue the previous medical diagnosis example, what should be done when the patient case histories that are to form the training set are incomplete? One way around the problem attempts to fill in an unknown value by utilizing in- formation provided by context. Using the previous notation, let us suppose that a collection C of objects contains one whose value o f attribute A is unknown. ASSIS

T A N T (Kononenko et al, 1984) uses a Bayesian formalism to determine the prob-

ability that the object has value Ai of A by examining the distribution of values of A in C as a function of their class. Suppose that the object in question belongs t o class P. The probability that the object has value Ai for attribute A can be expressed as

p r o b ( A = A i l class = P) = p r o b ( A = A i & c l a s s = P ) = pi prob(class = P) p

where the calculation of Pi and p is restricted to those members of C whose value of A is known. Having determined the probability distribution of the unknown value over the possible values of A, this method could either choose the most likely value or divide the object into fractional objects, each with one possible value of A, weighted according to the probabilities above. Alert Shapiro (private communication) has suggested using a decision-tree ap-" proach to determine the unknown values of an attribute. Let C ' be the subset of C consisting of those objects whose value of attribute A is defined. In C ' , the original

INDUCTION OF DECISION TREES 97

Table 3. Proportion of times that an unknown attribute value is replaced by an incorrect value

Replacement Attribute method 1 2 3

Bayesian 28% 27% 38°7o Decision tree 19% 22% 19070 Most common value 28°7o 27°70 4007o

class (P or N) is regarded as another attribute while the value of attribute A becomes the 'class' to be determined. That is, C ' is used to construct a decision tree for deter- mining the value of attribute A from the other attributes and the class. When con- structed, this decision tree can be used to 'classify' each object in C - C ' and the result assigned as the unknown value of A. Although these methods for determining unknown attribute values look good on paper, they give unconvincing results even when only a single value of one attribute is unknown; as might be expected, their performance is much worse when several values of several attributes are unknown. Consider again the 551-object 39-attribute task. We may ask how well the methods perform when asked to fill in a single unknown attribute value. Table 3 shows, for each o f the three most important at- tributes, the proportion o f times each method fails to replace an unknown value by its correct value. For comparison, the table also shows the same figure for the simple strategy: always replace an unknown value of an attribute with its most common "value. The Bayesian method gives results that are scarcely better than those given by the simple strategy and, while the decision-tree method uses more context and is thereby more accurate, it still gives disappointing results. Rather than trying to guess unknown attribute values, we could treat ' u n k n o w n ' as a new possible value for each attribute and deal with it in the same way as other values. This can lead to an anomalous situation, as shown by the following example. Suppose A is an attribute with values [ A1, A2 } and let C be a collection of objects such that

Pl = 2 p2 = 2 n~ = 2 n 2 = 2

giving a value o f 1 bit for E(A). Now let A ' be an identical attribute except that one of the objects with value A1 o f A has an unknown value o f A '. A ' has the values {A ' 1, A ' 2, A ' 3 = unknown }, so the corresponding values might be

P ' I = 1 P ' 2 = 2 P ' 3 = 1 n ' l = 2 n ' 2 = 2 n'3 = 0

I N D U C T I O N OF DECISION TREES 99

ERROR RATE (%)

40

1 I I I t I 10 20 30 40 50 60 I G N O R A N C E LEVEL ( % ) Figure 5. Error produced by unknown attribute values.

that class with the higher value. The distribution of values over the possible classes might also be used to compute a confidence level for the classification. Straightforward though it m a y be, this procedure has been found to give a very graceful degradation as the incidence of unknown values increases. Figure 5 sum- marizes the results of an experiment on the now-familiar task with 551 objects and •39 attributes. Various 'ignorance levels' analogous to the earlier noise levels were ex- plored, with twenty repititions at each level. For each run at an ignorance level of fn percent, a copy of the 551 objects was made, replacing each value of every attribute by ' u n k n o w n ' with m percent probability. A decision tree for these (incomplete) objects was formed as above, and then used to classify a new copy of each object corrupted in the same way. The figure shows that the degradation of performance with ignorance level is gradual. In practice, of course, an ignorance level even as high as 10% is unlikely - this would correspond to an average of one value in every ten o f the object's description being unknown. Even so, the decision tree produced from such a patchy training set correctly classifies nearly ninety percent of objects that also have unknown values. A much lower level o f degradation is observed when an object with unknown values is classified using a correct decision tree. This treatment has assumed that no information whatsoever is available regarding an unknown attribute. Catlett (1985) has taken this approach a stage further by allowing partial knowledge of an attribute value to be stated in Shafer notation (Garvey, Lowrance and Fischler, 1981). This notation permits probabilistic asser- tions to be made about any subset or subsets of the possible values of an attribute fhat an object might have.

100 J.R. QUINLAN

7. The selection criterion

Attention has recently been refocussed on the evaluation function for selecting the" best attribute-based test to form the root of a decision tree. Recall that the criterion described earlier chooses the attribute that gains most information. In the course of their experiments, Bratko's group encountered a medical induction problem in whicli the attribute selected by the gain criterion ('age of patient', with nine value ranges) was judged by specialists to be less relevant than other attributes. This situation was also noted on other tasks, prompting Kononenko et al (1984) to suggest that the gain criterion tends to favor attributes with many values. Analysis supports this finding• Let A be an attribute with values A~, A2.... Av and let A ' be an attribute formed from A by splitting one of the values into two. If the values of A were sufficiently fine for the induction task at hand, we would not expect this refinement to increase the usefulness of A. Rather, it might be anticipated that excessive fineness would tend to obscure structure in the training set so that A ' was in fact less useful than A. However, it can be proved that gain(A ') is greater than or equal to gain(A), being equal to it only when the proportions of the classes are the same for both subdivisions of the original value. In general, then, gain(A') will exceed gain(A) with the result that the evaluation function of Section 4 will prefer A ' to A. By analogy, attributes with more values will tend to be preferred to at- tributes with fewer. As another way of looking at the problem, let A be an attribute with r a n d o m values and suppose that the set of possible values of A is large enough to make it unlikely, that two objects in the training set have the same value for A. Such an attribute would have m a x i m u m information gain, so the gain criterion would select it as the root of the decision tree. This would be a singularly poor choice since the value of A, being random, contains no information pertinent to the class of objects in the training set. A S S I S T A N T (Kononenko et al, 1984) solves this problem by requiring that all test~ have only two outcomes. If we have an attribute A as before with v values A1, A2,

-.. Av, the decision tree no longer has a branch for each possible value. Instead, a subset S of the values is chosen and the tree has two branches, one for all values in the set and one for the remainder. The information gain is then computed as if all values in S were amalgamated into one single attribute value and all remaining values into another. Using this selection criterion (the subset criterion), the test chosen for the root o f the decision tree uses the attribute and subset of its values that maximizes the information gain. Kononenko et al report that this modification led to smaller decision trees with an improved classification performance. However, the trees were judged to be less intelligible to h u m a n beings, in agreement with a similar finding of Shepherd (1983). Limiting decision trees to a binary format harks back to CLS, in which each test was o f the form 'attribute A has value Ai', with two branches corresponding to true and false• This is clearly a special case of the test implemented in ASSISTANT, which