0 0(u)= 0 otherwise ' The VC-dimension of the set of linear discriminant functions with n inputs is equal to n + 1 (Vapnik, 1982). Attempts to obtain the exact value of the VC-dimension for other classes of decision rules encountered substantial di culties. Most of the estimated values appear to exceed the real one by a large amount. Such poor estimates result in rather crudely overestimated evaluations of the confidence interval D. This article considers the possibility of empirically estimating the VC-dimension of a learning machine. The idea. that this might be possible was inspired by following observations. > } = 0 In other words the one-sided uniform convergence (within a given set of function) of frequencies to proba- bilities is a necessary and sufficient condition for the consistency of the learning process. In the 30s Kolmogorov and Smirnov found the law of distribution of the maximal deviation between distribution function and an empirical distribution function for any random variable. This result can be formulated as follows. For the set of functions the equality P{sup(p,(a) - v,(a)) > e}: exp{-2ea/} - 2 (-1) ' exp{-2eana/} 1 1 holds for sufficiently large l, where p,(a) = Ef,(a) and ,(a) = ? Ei:i f,(xi, a). This equality is inde- pendent of the probability measure on x. Note that the second term is very small compared to the first one. In (Vapnik and Chervonenkis, 1971) it was shown that for any set of indicator functions f(x, E A with finite VC-dimension h, the rate of uniform convergence is bounded as follows ( { l/ 'n2l/h+l P{sup(p(cQ- )) > } < min 1,exp Cl l a l where cl and ca are universal constants such that cl < 1 and ca > 1/4. As above, this inequality is independent of the probability measure. A direct consequence is that for any given 5 there exists an l0 = lo(h, 5, ) such that for any l > 10, the inequality P{sup(p(a) - v(a)) > } < exp{-(ca - holds. In (Vapnik and Chervonenkis, 1971) it was shown that ca is at least 1/4. The result was later improved to ca : 2 by (Devroye, 1982). Interestingly, for ca : 2, the above inequality is close to the Kolmogorov-Smirnov equality obtained for a simple set of functions. This means that, although the above is an upper bound, it is asymptotically close to the exact value. Tightening the upper bound on the value of Cl is theoretically very difficult, because the term it multiplies is not the main term in the exponential for large values of l/h. Now, suppose that there exists a value for the constant Cl, 0 < Cl _< l, for which the above upper bound (which is independent of the probability measure) is tight. Furthermore, suppose that the bound is tight not only for large numbers of observations, but also for smaller numbers (statisticians commonly use the asymptotic Kolmogorov-Smirnov test starting with l = 30 and the learning processes usually involves several hundreds of observations). In this case, one can expect the function 4p*(l/h) defined by qb*( ): E(s PA(p(c Q -- )): P{sup(p(cQ - )) > to be independent of the probability measure, even for rather small values of l/h. Now, assuming we know a functional form for *, then we can experimentally estimate the expected maximal deviation between the empirical risk and expected risk for various values of l, and measure h by fitting * to the measurements. The remainder of the paper is concerned with finding an appropriate functional form for *, and applying it to measuring the VC-dimension of various learning machines. A learning algoritlnn is said to be consistent if the probabihty of error on the training set converges to the true probabihty of error when the size of the training set goes to infinity. > 0.5) and when the probability P{Ba} is close to one. Such a situation seems to arise when the ratio l/h is not too large and when l is large. In this case, when P(Ba) is close to one, l/h is not large and l is large, the bound E{sup(u (Z 2t, c ) - ,,-2, C1 ln(2l/h) + 1 ))} < eA -- l/h is valid, where U1 is a constant. In the following, we shall make use of this bound for small l/h when constructing an empirical estimate of the expectation. Case 3 (l/h is large). In appendix 1 we show that for the set of indicator functions with VC-dimension h the following bound holds: _ c /lnOi/h) + (7) where G'2 is a constant. This bound is true for all l/h. We shall use it only for large l/h, where the bound of case 2 is not valid. Frolu the conducted experiluents described in the section 6 we find that g 8. > h for which the inequalit is f lfilled, the followin9 bounds E{sup( ', a) _ 'tZ 1 4/ ( 'n(2l/ )+l 1 - 'V(B) ) E{sup(u (Z 2', t )- u}(Z 2', t ))} < V l +1 3 + (10) x/lh*(ln(2l/h*) + 1) are valid. Remark. If 0 the inequalities are true for all l > h. Note that in this case the left hand side of inequalities (9) and (10) do not depend on X*, but the right hand side does. Therefore the tightest inequality will be achieved for the X* with the smallest h*. Definition. The effective VC-dimension of the set if:r, C A, (for the given measure P) is the minimal VC-dimension of this set of functions defined on all subsets X* C X whose measure is almost one (/t(X*) > 1 - /*, where /* > 0 is a small value). As in the previous section, from (9) we obtain that for the effective VC-dimension h* the following inequality is true E{sup( ta)_ ttZ9 ta,, I 4P(B)(ln(2l/h*)+l 1- ln (B))(1 P(B)). (11) l/h* For the case when P(B,) is close [o one, the qua.n[i[y l/h* is small, and l is large we obtain p(d(Z *) + ))} < /h* where C is a constant. Since the bounds (9) and (10) have the same form as the bounds (5) and (7), to simplify our notation we use h in the next sections to denote the effective VC-dimension. 4 The Law of the largest deviations. In section 2 we gave several bounds, for different cases, on the expectation of the largest deviation over the set of decision rules. In this section we first give a single functional form that reflects the properties of the bounds. Then we conjecture that the same functional form approximates the expectation of the largest deviation. The bounds given in the previous section are summarized as follows 1 if l/h_<0.5 qa)+l E{sup(u t, t ) t 21 C1 if l/h is small (12) - , _< A C /ln(21/h)-I-1 .2 V if l/h is large. > h: E{sup ) - ))} ). 04) The constants can be found empirically by fitting to experimental data using a machine whose VC- dimension is known. If we assume that the constants obtained are universal, we can then use the function obtained this way to measure the capacity of other learning machines. 5 Measuring the Effective VC-Dimension of a Classifier If the function (r) approximates the expectation with sufficient accuracy, then the effective VC-dimension of the set of indicator functions realizable by a given classifier can be estimated on the basis of the following experiment. First, we generate a random independent set of size 2l, (15) using a generator of random vectors P(x) and a (possibly non-deterministic) generator of labels P(c01x ). Using this sample, we measure the quantity : , - ,.)). by approximating the expectation by an average over N independently generated sets of size 2l. i=1 (16) h* = arg n (li) - q)(li Ih)) i=1 The accuracy of the obtained VC-dimension estimate, and in fact, the validity of the approach presented in this paper, depend crucially on how well the function (l/h) describes the expectation (14). Repeating the above procedure for various values of l, we obtain the set of estimates (/1),..., (lk). The effective VC-dimension of the set of functions f(x, C A, can then be approximated by finding the integer parameter h* which provides the best fit between (l/h) and the (/i)'s: > _ j As described previously, the classifier was trained with gradient descent, but a very small "weight decay" term was added (the reason for the weight decay will be explained below). The parameter fi determines the amount of smoothing: for very large fi, the components of the vector y are virtually independent, and the VC-dimension of the learning machine is close to n. As the value of fi decreases, strong correlations between input variables start to appear. This causes the distribution of preprocessed vectors to have a very small variance in some directions. Intuitively, an appropriate weight decay term will make these dimensions effectively "invisible" to the linear classifier, thereby reducing the measured effective VC-dimension. The measured VC-dimension decreases down to 1 for fi = 0. When fi takes non-zero values, there is no theoretically known evaluation of the VC-dimension, and the success of the method can be measured only by how well the experimental data is described by the functional q Figures 5(a-c) show the average maximal deviation as a function of l, for different smoothing parameters fi, and demonstrates how well the function q --z), where h* is the estimate of the effective VC-dimension, approximates the experimental data. Figure 6 shows the estimated value of the VC-dimension as a function of the smoothing parameter fl. 7 Conclusion We have shown that there exist three different regimes for the behavior of the maximum possible deviation between the frequencies of error on two different sets of examples. For very small values of l/h (< 1/2), the maximal deviation is 1, for small values of l/h (from 1/2 up to about 8), it behaves like log(2l/h)/(l/h), for large values it behaves like x/log(2l/h)/(l/h). We have introduced the concept of effective VC-dimension, which takes into account some weak properties of probability distribution over the input space. We prove that the effective VC-dimension can be used in place of the VC-dimension in all the formulae obtained previously. This provides us with tighter bounds on the maximal deviation. Based on the functional forms of the bounds for the three regimes, we propose a single formula. that contains two free parameters. We have shown how the value of these parameters can be evaluated experi- mentally. We show how the formula. can be used to estimate the effective VC-dimension. We illustrate the method by applying it to various learning machines based on linear classifiers. These experiments show good agreement between the model and the experimental data in all cases. Interestingly, it appears that, at least within the set of linear classifiers, the values of the parameters are universal. This universality was confirmed by recent results obtained since the first version of this paper with linear classifiers trained on real-life tasks (image classification). Excellent fit with the same values of the constants were obtained even when the classifiers were trained by minimizing the empirical risk subject to various types of constraints. This included simple constraints, such as a limit on the norm of the weight vector (equivalent to weight decay), and more complex ones which improve the invariance of the classifier to distortions of the input image by limiting the norm of some Lie derivatives (Cortes, 1993; Guyon et al., 1992). The extension of the present work to multilayer networks faces the following difficulties. First, the theory is derived for methods that minimize the empirical risk. However, existing learning algorithms for multilayer nets cannot be viewed as minimizing the empirical risk over entire set of functions implementable by the network. Because the energy surface has multiple local minima, it is likely that, once the initial parameter value is picked, the search will be confined to a subset of all possible functions realizable by the network. The capacity of this subset can be much less than the capacity of the whole set (which may explain why neural nets with large numbers of parameters work better than theoretically expected). Second, because > = : . *)-uZ and the event 5 defined as in equation (4). The following lemma is true 1,..., ; p{-W} The proof of this lemma will follow the same scheme as the proof of Theorem A2 in (Vapnik, 1982) pp. 170-172. Write the probability in the evident form ) = (19) Z lyl [ 21 0{(pl(Z 2', *) - ,,, , _ (Pl( Z2', ,*) -- = ,-*)) _ 5}dp(Z 0 - whereZ(2/) is the space of samples Z u P(Z21' *) : t lk ' ] + 2k ' and f(x, *) is the function which attains the maximum deviation of frequencies in two halPsamples. Denote (i = 1, 2,..., (2/) ) the set of all possible permutations of the elements of the sample (18). Denote also ,, .*) = d( ,, .*) _ 4( ', .*) = ( ' , .*) + ,, .*)). It is obvious that the equality fz x( ) = O{x(TiZ 2', o *) - e}O{ o. ,) ( o{(x( , -*) - 4 i=1 is true. Note that for any fixed Z u and the following inequality is true: (20) a{x( ', .*) _ }a{ . ,, .*) - 5} _< > , (22) -k (, +1,1 , -1)) (23) We now estimate the value P. The derivation of the estimate of P repeats the derivation of similar estimates in (Va.pnik, 1982) pp. 173-176. __ 4(1+1) m (m+1)(21--m+1) (k* -- 9 - q- 1) 2 if m is a,n even number lnP/2 < 4(1+1) m--1 l)(k* -- m--1 +1)( (k* ) if m is a,n add number (24) where k* is the lea.st integer satisfying the condition k* m l 2l 2 (25) From (23) a,nd (24) we obtain lnP/2 < - (k* - m/2). And from this inequa,lity a,nd (25) we obta,in lnr/2 < -5 d. Substitute now the estimate of the va.lue P into right-ha.nd side of the inequa.lity (20). We obta.in (26) > For this estimation divide the interval area into two parts: (0,u), where we use the trivial estimate of conditional probability ,) = , and the area (u., ), were we use the estimate (26). We obtain IB, ) < de + (T) P(B,) exp{-T}& = u + (T) ISP(B,) exp{-7} The minimum of the right-hand side of the inequality is attained when u = 4 h(ln(2l/h) + 1) - In P(B,) In such a way we obtain the estimate 4h(ln(2l/h) + 1) - In P(B,) + 1 ) < The theorem is proved. Now we want to obtain the estimate of the expectation of the random value (4) for an arbitrary l/h. For this we use the inequality (Vapnik, 1982) p.172 r{sup Ip (rt, Z 2') - ' 2, e} 3( ) a exp{-e2/} (EA As in the case above we get E{sup M( ') - exp{_e2/}& < , j j< de+3( 2le a 2le a 1 exp{-u/} ,, + a(T ) exp{-e,,t}de: ** + a(T ) For /ln(2l/h) + 1 ":V t7r we obtain the estimate /ln(W x/lh(ln(2l/h) + 1) /ln(2l/h) 1 < c + >
Back to Index Published in Neural Computation (6)5, 1994, pp 851-876
The Vapnik-Chervonenkis dimension is a mathematical quantity that measures the "capacity" of a learning machine.
Digital Library Home