Title
Robust separation of finite sets via quadratics
Abstract
Given a pair of finite disjoint sets A and B in R n , a fundamental problem with many important applications is to efficiently determine a non-trivial, yet ‘simple’, function f (x) belonging to a prespecified family F which separates these sets when they are separable, or ‘nearly’ separates them when they are not. The most common class of functions F addressed to data are linear (because linear programming is often a convenient and efficient tool to employ both in determining separability and in generating a suitable separator). When the sets are not linearly separable, it is possible that the sets are separable over a wider class F of functions, e.g., quadratics. Even when the sets are linearly separable, another function may ‘better’ separate in the sense of more accurately predicting the status of points outside of A ∪ B . We define a ‘robust’ separator f as one for which the minimum Euclidean distance between A ∪ B and the set S={x∈ R n : f (x)=0} is maximal. In this paper we focus on robust quadratic separators and develop an algorithm using sequential linear programming to produce one when one exists. Numerical results are presented. Scope and purpose A fundamental problem with many important applications is to efficiently determine a nontrivial, yet ‘simple’, function f (x) which separates a pair of sets A and B in the sense that f is positive over A and negative over B . The function is then used to associate either A or B with points outside of the sets. As an example, if A consists of the results of tissue samples of cancerous patients, and B consists of the results of tissue samples from non-cancerous patients, a new sample c will be associated with either A or B according to the sign of the value f (c) . Most of the literature to date has focused on linear functions f as they are relatively easy to compute. In this paper we explore the use of quadratic functions. The advantage of using such functions is two fold — they can often separate when linear functions cannot, and they can separate more accurately than linear functions. We first define the notion of a ‘robust’ separating function which is as immune as possible (given the data) to small perturbations of the data. We then suggest an algorithm to (approximately) compute a robust quadratic separator, and show that it can be computed via a sequence of linear programs. The algorithm is tested on both randomly generated problems, as well as on the publicly available ‘Wisconsin Breast Cancer Database’. Its accuracy on this database is somewhat higher than that obtained by using linear robust separators. Keywords Pattern classification Separation Non-convex optimization 1 Introduction, review and preview 1.1 Introduction The terms ‘discriminant analysis’, ‘pattern classification’, and ‘separation of sets’ are different expressions describing the common and practically important problem of partitioning R n into t subsets, based on a given set of data points D composed of t types of data. The applications include: medical diagnosis, scoring of credit card applications, biological specimen classification, digit and handprinted character recognition, and identification of tax returns for audit. There are two quite different approaches to this problem, ‘parametric’ and ‘non-parametric’ where the former applies to methods wherein one assumes that the points of D are outcomes of random experiments whose underlying random variables are of a certain type with parameters to be estimated from D . We are interested in non-parametric approaches wherein no such statistical assumptions are made. In this paper we limit our attention to the two-set separation ( t =2) for several reasons, among which are: • the multi-set problem can be solved by solving a sequence of two-set separation problems (e.g., see, [1] ) • two (or more) disjoint sets can be completely separated by solving a sequence of separating problems (e.g., see [2] ), and • the two set separation problem has interest in its own right. To illustrate the latter point, we briefly describe an application of the two-set linear separation problem to the detection of breast cancer (see [3,4] for a complete description of the model). We discuss the accuracy of our model on this data set in Section 5.2 . Here the database D consists of 683 points (535 in an early study), each a vector in Euclidean nine-space, R 9 . Each vector corresponds to a tissue sample — fine needle aspirate (FNA) of human breast tissue taken from a specific patient. Each of the nine components of the vector corresponds to a specific attribute of the sample. 1 1 The nine attributes are clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitosis. The numbers in each component are integers ranging from 1 to 10, and represent a pathologist's judgement as to the degree that the given sample displays this attribute. A sample point is classified into set A (or set B ) depending on the presence (or absence) of cancer in that sample. In the aforementioned database, 239 of the 683 samples were cancerous. When a new patient's sample, say P , is obtained, we want to associate it with one of the two sets A or B . In the event that these sets are separable by a hyperplane H(w, γ)=w·x−γ , we then would say P∈S A ={x : x·w⩾γ} and with the set B otherwise. In the event that P ∈ S A , the patient would either undergo a biopsy or be re-examined to confirm the diagnosis. When A and B are not linearly separable, a possible approach is to construct a piecewise linear separator by enclosing the ambiguous points in a band (a pair of parallel hyperplanes) according to some rule, eliminate the correctly classified points, then try to linearly separate the remaining ambiguous points. Several applications of this procedure might be needed to completely separate A and B ; see, e.g., Mangasarian [2] . A simple (but potentially computationally expensive) scheme is to associate each point with the type of its closest Euclidean neighbor. In this way, R n is partitioned into as many ‘cells’ as there are points in D — this is the so-called Voronoi Partition (see, e.g., Preparata [5] ), and we set S A (S B ) equal to the union of all cells of all points in D j which are closest to A ( B ). Ties are not addressed here, it is assumed that ties breaking rules can be broken according to some unambiguous rule based on the specific application. With some abuse of notation, we will use the same symbols to denote matrices whose rows are the points of A and B . Let | A |= q and | B |= r , so that A is q by n and B is r by n . Much of the work on the two set separation problem involves finding linear functions which separate. This is so because most of the models studied involve the solution of linear programs. When A and B are ‘strongly’ linearly separable there is an infinite number of linear separators which separate. In this case, we are interested in the unique ‘robust’ linear separator H(w, γ)={x : w·x−γ=0} which has the property that the (Euclidean) distance between it and the data points P = A ∪ B is maximal. Thus H(w, γ) will still separate A and B for (at least) small perturbations of the points. This has, at least for the aforementioned breast cancer study, produced more accurate separators at the (modest) expense of having to solve more LPs. Another feature of the robust linear separator is that if the data points are subject to translations or rotations that preserve distance, the separator of the translated points is simply the translation or rotation of the original separator. In particular, if the labels of the two sets A and B are simply switched, the separating set S={x : w·x−γ=0} remains the same. In this paper, we study the use of quadratic separating functions and, in particular, robust quadratic separating functions. There are (at least) two reasons for this: improved separation properties, and computational feasibility. Thus we are interested in efficiently determining a non-trivial quadratic function q ( x )=(1/2) x T Mx + w · x − γ such that q ( a )⩾0 for all a ∈ A and q ( b )⩽0 for all b ∈ B when one exists. Generally if one does exist, there will be an infinite number of such separating functions. We will then determine a robust separating function, i.e., a function which will still separate the sets if the data points of A and B undergo small perturbations. Clearly such functions cannot produce worse separators than the class of linear functions, and usually produce better separators. In fact, quadratic separators may well separate A and B when linear functions do not — this was the case for the breast cancer data. Fortunately, it also turns out that the amount of additional computation required is still modest. Quadratic functions often separate two sets A and B which are not linearly separable. For example, the sets A={(1, 0), (−1, 0), (0, 1), (0,−1)} and B={(0, 0)} in R 2 cannot be separated by a linear function, but are quadratically separable (e.g., by q ( x )= x 1 2 + x 2 2 −0.25). Even sets which can be separated by linear functions can ‘more robustly’ be separated by quadratics. For example, in R 2 let A be the set {x∈R 2 : x 1 2 +x 2 2 =4, x 2 ⩾ 1 2 } and let B={(0, 0)} . The line x 2 = 1 4 linearly and robustly separates A and B , but the quadratic x 1 2 +x 2 2 =1 more robustly separates (see Fig. 1 ). In terms of computability, there are two issues. Once a quadratic q ( P ) separating function is determined, the classification of a (new) data point P is nearly as trivial as the computation of w · P . The main issue however, is the computation of q ( x ) i.e., the computation of M, w and γ . It is true that more computational machinery is required (as compared to the computation of linear separators), but we will show that it is not unreasonable. This is somewhat less of an issue as one might expect, as in practice one need only compute one separator based on a given data set to use on subsequent data, and so it is not absolutely essential that the computation of q be ‘ultra-efficient’. The algorithm which we will present was programmed in MATLAB on a Pentium Pro 400 MHz personal computer and produced robust quadratic separators in a matter of minutes for the breast cancer data. We will assume throughout that A and B are quadratically separable (this can be determined by the solution of a simple LP). As mentioned, this was the case for the breast cancer data. If this assumption fails to hold, one can possibly resort to extensions of methods described in Falk and Lopez-Cardona [6] or Mangasarian [2] but we have not attempted to do so. 1.2 Review Most of the existing models which have been proposed to identify separating functions are focused on identifying linear separators. Some even focus on identifying ‘robust’ linear separators, although they often use a different norm. We will not survey these (linear) models in this paper, but most are referenced in Falk and Lopez-Cardona [6] or Stam [7] . The one model that does produce a linear robust separator was introduced by Cavalier, Ignizio and Soyster [8] . In this paper they suggest a ‘heuristic’ procedure for finding one. Falk and Lopez-Cardona [6] : • showed that the Cavalier et al.'s ‘heuristic’ procedure is guaranteed to converge, • developed a more efficient algorithm for solving the model, and • developed a branch and bound procedure to ‘separate’ A and B as nearly as possible when A and B are not separable. For quadratic separators, there is relatively little literature, and none which addresses the robust quadratic separation problem. In 1965, Rosen [9] proposed an iterative procedure for finding an ellipsoid E(M,u)={v : (v−u)′M(v−u)=1} of minimum volume surrounding a specified set of points a ∈ A . In the same year, Mangasarian [2] showed that two sets A and B are quadratically separable if and only if the optimal value of the following programming problem: (1) max s,t,M,w s−t subject to ( 1 2 )a T Ma+a T w⩾s for all a∈A, ( 1 2 )b T Mb+b T w⩽t for all b∈B, −1⩽w i ,M ij ⩽1 is positive, i.e., s ∗ −t ∗ >0 . As a consequence, a necessary and sufficient condition for inseparability or weak separability of sets A and B is that the value of the above problem is zero. In the strongly separable case, the solution of this problem is not unique, and hence it is unlikely to produce a robust separator. In 1994, Silva and Stam [10] introduced modified formulations of the ‘Minimize Sum of Deviations’ [11] and a Hybrid Model [12,13] for quadratic separation. These models employ ‘deviational variables’ and minimize the sums of these variables. Such models can be considered to produce ‘robust separators’ with an L 1 norm. Duarte Silva and Stam given compelling evidence through their experiments, that these ‘second order extensions of the well known methods’, can outperform traditional methods when the data departs significantly from multivariate Gaussian family of distributions. Finally, Banks and Abad [14] proposed to use linear programming heuristics applied on a quadratic transformation of the observation data. In this method, they used quadratic transformations of data to be consistent with a ‘Quadratic Discriminant Function’, a parametric method proposed by Smith [15] . 1.3 Preview We now preview the contents of this paper. In the next section, we define weak and strong separation, and motivate and define the basic optimization model to be solved. In the same section we mention some of the previous models and work which has been done. Section 3 focuses on robust linear and quadratic separators, with mention of the former to emphasize the additional computation required to find a robust quadratic separator. In Section 4 , we present an algorithm to find a robust quadratic separator, give an example to illustrate the method, discuss convergence of the method, and address some computational issues. Section 5 contains computational results, both on randomly generated problems and on the Wisconsin Breast Cancer Database. Finally an appendix is included, which focuses on the problem of finding the distance between a point and the zero set of a given quadratic function. With some abuse of notation, we will use the term ‘linear function’ for functions of the form f (x)=w·x−γ instead of the more correct term ‘linear affine function’. 2 General separability and robustness Definition 1 Following Marlow [16] , we say that these sets are linearly separable if there is some linear function w · x − γ such that ω·a−γ⩾0 for all a∈A and w·b−γ⩽0 for all b∈B where || w ||≠0. If the sets are linearly separable and there is a linear function that satisfies these inequalities in a strict sense, we say that these sets are strongly linearly separable , otherwise we say that the sets are weakly linearly separable . 2 2 This convention is not universally accepted, especially among papers dealing with pattern recognition (e.g. Mangasarian [12] ) where the concepts of ‘strong separability’ and ‘separability’ are considered interchangeable and weak separability is not addressed. The reason for this is probably that in most applications, the sets are either strongly separable or they are not separable. If no such linear function exists that satisfies the above inequalities, the sets are simply not separable . If A and B are strongly separable, there will be an infinite number of linear functions which strongly separate. This may or may not be true in the weakly separable case. We can extend the above definition to allow for arbitrary functions f in some class of functions F (e.g., quadratics). Definition 2 We say that the sets A and B are separable by f if f (a)⩾0 for all a∈A and f (b)⩽0 for all b∈B, where f (x) is not identically 0. If the sets are separable by f and the inequalities are all strict inequalities, we say that A and B are strongly separable by f . If A and B are separable and there is no non-trivial function f ∈ F which strongly separates A and B we say that the sets are weakly linearly separable . If there are no functions f ∈ F which separate, we can say that the sets are simply not separable over F . Note that for a given family F , a given pair of sets A and B satisfies exactly one of the three relations: they are either strongly separable, weakly separable, or not separable. If the family F is closed under addition and scalar multiplication (as are the families of linear functions, and quadratic functions), when A and B are strongly separable there are an infinite number of functions which separate them. Henceforth we will assume that F is closed under addition and scalar multiplication. The following result is easy to prove: Theorem 1 The set of all separators of A and B forms a cone with the origin removed . Likewise , the set of all strong separators of A and B forms a cone with the origin removed . When A and B are strongly separable, we seek a separating function f ∈ F which ‘robustly’ separates in the sense that the minimum Euclidean distances between the separating set S={x∈ R n : f (x)=0} and A ∪ B is maximal over all separating functions in F . Note the dependence of S on f . Definition 3 Let F̄ denote the set of functions { f∈F : f separates A and B }. Then, we can define a robust separator: Definition 4 A function f ∈ F robustly separates A and B if it is a solution of the optimization problem (2) max f∈ F ̄ min c∈A∪B min x∈S ||x−c||. Here the inside term min x ∈ S || x − c || measures the smallest distance between a point c ∈ A ∪ B and the separating set S (which depends on f ), so the problem min c∈A∪B min x∈S ||x−c|| measures the smallest distance between S and the set A ∪ B . A robust separator is one which maximizes this minimal distance. Fig. 2 exhibits these remarks. In general, there could be more than one robust separator of A and B . For example, if F is the set of quadratic functions, let A={(1, 0)} and B={(−1, 0)}. There are a number of robust quadratic separators as illustrated in Fig. 3 . In the case that F is the set of linear functions, it can be shown that a robust linear separator is unique. When F is the set of quadratic functions, we know of no condition on the sets A and B which would guarantee uniqueness. One might think that uniqueness would be guaranteed if the number of points q + r in A and B is sufficiently large, but the simple example with A={(1, 0), (2, 0), (3, 0),…} and B={(−1, 0), (−2, 0), (−3, 0),…} has an infinite number of robust quadratic separating functions. One interesting theoretical problem which we do not address is trying to distinguish a ‘best’ robust quadratic separator when there is an infinite number of such functions. We do not address this issue, and it did not come up in any of the data which we used to test our algorithm. However, one can show that the set of all robust separators in F is a cone with the origin removed: Theorem 2 If f is a robust separator , then so is τf for any τ >0. If f 1 and f 2 are robust separators , then so is the function f 3 = f 1 + f 2 . Proof The proof that τf robustly separates is trivial. Now assume that f 1 and f 2 are robust separators. This means that the distance between the set A ∪ B and the sets S i ={x∈ R n : f i (x)=0} is maximum over all separating functions. Let this distance be denoted by τ ̄ (it must be the same distance for both sets, else the f i yielding the smaller distance is not robust). We need to show that f 3 (x)=f 1 (x)+f 2 (x) is a robust separator, i.e., the distance between the set A ∪ B and the set S 3 ={x : f 3 (x)=0} equals τ ̄ . Suppose this is not true, i.e., suppose the distance τ ′ between S 3 and A ∪ B is smaller than τ ̄ (it cannot be larger as that would violate the assumption that f 1 and f 2 are robust separators). Let p be a point in S 3 nearest to A ∪ B . We may suppose p is closest to a point in A , i.e., || p − a ||= τ ′ for some a ∈ A . Then p∈{x : f i (x)>0} for both i =1 and i =2 because the distance τ ′ between p and A ∪ B is smaller than the distance τ ̄ between A ∪ B and the closest points in S i and the functions f 1 and f 2 are continuous (see Fig. 4 ). At the same time, we know that f 1 (p)+f 2 (p)=0. We cannot have both f 1 ( p ) and f 2 ( p )=0 because we know that τ′< τ ̄ and τ ̄ is supposed to be the distance between the sets S i and A ∪ B . Therefore it must be that the f i ( p ) are of different signs. This is a contradiction. □ There is an alternate definition for robustness which is useful in that it allows us to build an algorithm to find such a separator. For any finite set C ={ C 1 ,…, C s } and any scalar τ >0, let C τ ={c : ||c−C k ||⩽τ for some C k ∈C}. Definition 5 A function f ∈ F robustly separates A and B if it weakly separates the sets A τ ̄ and B τ ̄ , where τ ̄ is the largest value of τ such that the sets A τ and B τ are separable. Theorem 3 Definitions 4 and 5 are equivalent. Proof Suppose f∈ F ̄ robustly separates A and B according to the first definition. Then it's minimum distance to A ∪ B is some amount τ ̄ . Create hyperspheres about all points of A ( B ) and form their union A τ ̄ (B τ ̄ ) . Note that f weakly separates A τ ̄ and B τ ̄ as it strongly separates the interiors of these sets and yet there are points on their boundaries for which f (x)=0 . The sets cannot be further expanded (that would contradict the robustness of f ) so that f must also robustly separate according to the second definition. On the other hand, if f robustly separates according to the second definition, the distance of S to A and B must be at a maximum (else the sets A and B could be further expanded). □ The equivalence of these definitions is important in this work as the problem (2) appears difficult to solve when F ̄ = Q ̄ (the class of quadratic separating functions) due to the complexity of the objective function min c∈A∪B min x∈S ||x−c||. We will address this point in Section 3 . The last theorem allows us (in Section 3 ) to point out that the problem (2) defining a robust quadratic separator actually has a solution (a fact that does not appear obvious). 3 Robust linear and quadratic separators In this section we will focus our attention on linear and quadratic separators. To avoid unnecessary diversions, we will assume that A and B can be strongly separated over the appropriate class. This is easily determined by solving the linear program 1. The case where A and B are not linearly separable is addressed in Falk and Lopez-Cardona [6] . We do not address the case where A and B are not quadratically separable. Let L denote the set of all non-trivial linear separators of A and B and Q denote the set of all non-trivial quadratic separators of A and B . Then the general problems which we seek to solve is (3) max f∈ F ̄ min c∈A∪B min x∈S ||x−c||, where F is either L or Q and S={x : f (x)=0} . When F = L the inner problem (4) min x∈S ||x−c|| is easy to solve as S is a hyperplane of the form S={x : w·x−γ=0} and the distance between a point c and S has the closed form | w · c − γ |/|| w ||. This leads to an optimization problem of the form max ||w||=1 min a∈A {w·a}− max b∈B {w·b} . In the case that A and B are strongly separable, the equality constraint || w ||=1 can be replaced by the inequality constraint || w ||⩽1 so that this is a convex program. An efficient algorithm for its solution is found in Falk and Lopez-Cardona [6] . When F = Q , the situation is quite different. In this case, S={x : ( 1 2 )x T Mx+w·x−γ=0}, where M is a given n × n matrix, w ∈ R n and γ ∈ R 1 (not all zero) and Problem 4 becomes much more complicated. In general, it is a non-convex problem. We have derived a general method to find the global solution of this problem which involves solving its dual — the details are in the Appendix. Briefly, one must • diagonalize M , • solve a one-dimensional maximization problem (the dual problem), and • recover the solution of 4. Note that functions f in (3) have the form f (x)=( 1 2 )x T Mx+w·x−γ so that f is determined by M , w and γ . Since we can assume that M is symmetric, there are a total of t = n ( n +1)/2+ n +1 variables to be determined. Clearly, the set of all triplets (M,w, γ) for which f separates A and B forms a cone in R t , so we can limit the size of the components of the triplet. This can be done by putting bounds on the components of M and w . Let W={(M,w) : −1⩽M ij ⩽1 for all i,j, and −1⩽w i ⩽1 for all i}. and recall the definition of the sets A τ and B τ introduced earlier. When τ =0, the problem (5) max W ,s,t s−t subject to W ∈W, ( 1 2 )a T Ma+a T w⩾s for all a∈A τ , ( 1 2 )b T Mb+b T w⩽t for all b∈B τ has a solution ( W 0 , s 0 , t 0 ) with s 0 − t 0 >0 because we assumed that A and B were strongly separable. Because the set of all strong separators forms a cone, we can assume that some element of W 0 is ±1, i.e., W 0 lies on the boundary of the compact set W . Note that for τ >0, problem (5) is a linear program, albeit with an infinite number of constraints. The point ( W , s , t )=0 is always feasible, and for τ sufficiently large, the optimal value of 5 is zero. As τ increases from 0, let ( W τ , s τ , t τ ) denote a solution of 5. As before, let (6) τ ̄ = sup {τ : s τ −t τ >0 where s τ −t τ is the optimal solution of 5}. We can assume that some element of W τ is ±1, i.e., W τ lies on the boundary of the compact set W . Pick any increasing sequence { τ i } i =1 ∞ in [0, τ ̄ ] which converges to τ ̄ . The corresponding sequence { W τ } i =1 ∞ lies on the boundary of the compact set W , and so has a point of accumulation W̄ on that boundary. This point is then a solution satisfying the second definition of a quadratic robust solution in Section 2 . We have thus proven the result: Theorem 4 If A and B are strongly separable over Q , then there is a robust quadratic separator of A and B . 4 Finding an approximate robust separator of A and B In this section we present an algorithm which will produce an approximate robust quadratic separator of A and B when they are strongly separable over Q . Because the subproblems are actually linear programs with an infinite number of constraints, we do not solve them exactly, and so can guarantee only ‘approximate’ solutions. A small illustrative example is given. We will then discuss convergence of the algorithm and some computational considerations. 4.1 The algorithm We now present an iterative algorithm that (approximately) solves Problem 6. In what follows, eps is a preset tolerance which measures how close an ‘incumbent’ quadratic separator comes to being a ‘robust’ quadratic separator. The symbol f ̄ k (x) is an ‘incumbent’ quadratic function which strongly, but perhaps not robustly, separates A and B at step k , and the symbol τ k is the distance between the incumbent and A ∪ B . Step 0: With A 0 = A and B 0 = B , solve LP 0 defined by (5) to determine if the sets are strongly separable via a quadratic. Let s 0 − t 0 denote the optimal value of LP 0 . If s 0 − t 0 =0, stop — the sets are not strongly separable. Otherwise, the solution of LP 0 consists of a matrix M 0 , a vector w 0 ( M 0 and w 0 cannot both be zero) and a pair of numbers s 0 and t 0 with s 0 − t 0 >0. Set γ 0 =( s 0 + t 0 )/2. If s 0 − t 0 ⩽ eps , stop. The quadratic function f 0 ( x )=(1/2) x T M 0 x + x T w 0 − γ 0 strongly separates A and B and is accepted as the robust separator of A and B . If s 0 − t 0 > eps , set f ̄ 1 (x)=f 0 (x) . Compute the minimum distance d 0 between the set S 0 ={x : f 0 (x)=0} and the q + r points in C = A ∪ B by the algorithm described in the appendix. Define τ 1 = d 0 and A 1 =A∪{x a : x a ∈S 0 ,||x a −a||=τ 0 ,a∈A} and B 1 =B∪{x b : x b ∈S 0 , ||x b −b||=τ 0 ,b∈B} . Thus if one must continue, the output of Step 0 is A 1 ,B 1 , f ̄ 1 , and τ 1 . At the beginning of step k=1, 2,… there is an incumbent quadratic separator f ̄ k (x) which strongly separates A and B and expanded sets A k and B k . The incumbent is at least a distance τ k from A and B . Step k : Solve LP k defined by (5) with the finite sets A k and B k replacing the infinite sets A τ k and B τ k . Let s k , t k , M k , and w k solve LP k . Set γ k =( s k + t k )/2 and f k ( x )=(1/2) x T M k x + x T w k − γ k . Compute the minimum distance d k between the set S k ={x : f k (x)=0} and the q + r points in C = A ∪ B (by the algorithm described in the appendix) and record those points x a and/or x b where the minimum distance d k is achieved. There are two possibilities. 1. d k < τ k (the new separator f k is not as robust as the incumbent f ̄ k ), or 2. d k ⩾ τ k (the new separator f k is at least as robust as the incumbent f ̄ k ). Suppose Case 1 occurs. If s k − t k < eps , stop and accept the incumbent f ̄ k as the robust separator. Otherwise, add new points of the form a +( τ k / d k )( x a − a ) and/or b +( τ k / d k )( x b − b ) to the sets A k and B k to form new sets A k +1 and B k +1 . Retain f ̄ k as the incumbent (set f ̄ k+1 = f ̄ k ) , retain τ k as the best robustness measure (set τ k +1 = τ k ), replace k by k +1 and continue. Suppose Case 2 occurs. If s k − t k < eps , stop and accept the new separator f k as the robust separator. Otherwise, update the sets A k and B k to A k +1 and B k +1 by adding the new points x a and/or x b , set f ̄ k+1 =f k as the new incumbent, update τ k +1 = d k , replace k by k +1 and continue 4.1.1 Discussion of the algorithm We are actually using a version of the algorithm described in Blankenship and Fark [17] to solve problem (5) for each fixed value of τ . This method selects an ever-expanding, yet finite, subset of the infinite number of constraints. As each ordinary linear program in the sequence is solved, its solution is checked as a possible optimal solution of the original problem. Those constraints of the latter problem that are violated by the incumbent solution are added to the expanding constraint set to define the next problem in the sequence. When Case 1 occurs, we continue to solve problem (5) by adding more constraints (points on the boundary of the hyperspheres centered at points in A and B of radius τ k −1 . Since more constraints are being added, it must follow that s k − t k ⩾ s k +1 − t k +1 and it is quite possible that the convergence criteria is satisfied with the current incumbent being declared robust. When Case 2 occurs, a new incumbent is identified. However, because the old constraints are being replaced by new ones in this case, it could happen that s k − t k < s k +1 − t k +1 . 4.2 Illustrative example We will illustrate our approach on the following simple example: A= 1 2 2 4 4 3 3 3 5 1 and B= 1 1 4 0 2 2 4 2 . Those sets are quadratically, but not linearly separable. Our algorithm used 10 iterations to find a robust quadratic separator with eps =0.001. We will illustrate the iterations graphically. The small intervals in the figures have a given point in A or B at one endpoint, and a generated point at the other. At iteration 0 (see Fig. 5 ), we found the separating quadratic to be M 0 = 0.4271 −0.3750 −0.3750 1.0000 , w 0 = −1 1 , γ 0 =−3.3802. Then we found the minimal distance d 0 =0.12396 which happened to be the minimal distance from point (1, 2) in set A . Since d 0 > eps we add an appropriate point to set A . At iteration 1 (see Fig. 5 ), we found the separating quadratic to be M 1 = 0.8640 −0.6800 −0.6800 1.6800 , w 1 = −2.6800 1.8640 , γ 1 =−4.0242. Then we found the minimal distance d 1 =0.14234 which happened to be the minimal distance from point (2, 2) in set B . Since d 1 > τ 1 we add an appropriate point to set B . Set τ 2 = d 1 . At iteration 2 (see Fig. 6 ), we found the separating quadratic to be M 2 = 0.2075 −0.0565 −0.0565 1.1761 , w 2 = −1.000 0.1196 , γ 2 =0.3909. We then found that there are two points: point (5, 1) from set A and point (4, 0) from set B , for which distance is less than τ 2 , therefore, we need to add appropriate points to sets A and B . Leave τ 3 = τ 2 . The final quadratic is ( Figs. 7–9 ) . M= 0.1827 −0.0508 −0.0508 0.1393 , w= −0.8682 0.2260 , γ=0.1695. 4.3 Convergence We can comment on convergence for the ‘idealized’ algorithm wherein each infinitely constrained problem P τ is solved exactly. Given τ k >0 Problem P τ k is solved exactly. One output is a strong separator f τ k which is a distance τ k +1 > τ k from the sets A and B . We can state Theorem 5 Successive distances τ k strictly decrease . Similarly, we obtain Theorem 6 The optimal solution values s k − t k of Problem P τ k strictly decrease . This follows because A k ⊆ A k +1 and B k ⊆ B k +1 so that the feasible region of Problem P τ k is strictly larger than the feasible region of the Problem P τ k +1 . The theoretical possibility exists, however, that the increasing sequence of distances { τ k } will converge to some τ′< τ ̄ . In such a case, a strong separator of A τ ′ and B τ ′ would be a distance τ ″> τ ′ from A and B and the sequence would continue. One could guarantee convergence of the method to forestall any possibility of stalling by imposing a kind of binary search procedure on an interval [0, τ + ] where τ + > τ ̄ is so large that (5) has only the trivial solution. We did not do this, as we did not observe any type of ‘stalling’ behavior on any of the problems which we solved. 4.4 Some computational considerations The method amounts to solving a sequence of ordinary linear programs. We used the solution of the k th problem to start to solve the ( k +1)st problem. This proved quite useful in our MATLAB implementation of the algorithm. In setting up the k th problem, one must find the closest point in A k ∪ B k which requires the eigenvalue decomposition of the current (trial) matrix M k . While this operation is potentially expensive, the size of M k is only n ( n =9 in the Breast Cancer Study), and was not a factor. Also note that the same diagonalization is used for all points in A k ∪ B k and so only needs to be done once for each update. 5 Computational results The algorithm described in Section 4 was implemented using MatLab 5.1, the mathematical modelling package from The MathWorks Inc., running on a Pentium Pro 400 MHz personal computer with 512 Mbytes of memory and a 16GB hard disk drive. Two types of computational results are presented, one based on randomly generated problems, and the second one taken from an established database related to cancer detection. We used the Simplex Method as well as the matrix diagonalization procedure that are implemented in MATLAB. 5.1 Results for randomly generated data sets To test the efficiency of the method we generated a random quadratic f (x)=( 1 2 )x T Mx+x T w−γ over R n , and a random m by n matrix C (the integers m and n are user inputs). All random numbers were realized by a random number generator from a uniform distribution over [0, 1] . We defined the two matrices A (q×n) and B (r×n) where rows A i of A are those rows of C that satisfy ( 1 2 )A i T MA i +A i T w−γ>0 and the rows of B are the remaining rows of C . In the event that either A or B had no rows, we would have disregarded the data and generated a new C . To evaluate the performance of the algorithm, we generated 50 problems for each of 32 (m, n) pairs. Each problem had n variables, and the same total number m = q + r of points. We increased n in increments of 3 from 3 to 12, and m in increments of 25 and 50 from 25 to 100 points and from 100 to 300 points. A tolerance of 0.01 was used for the stopping criteria. The results are tabulated in Table 1 . The top entry in each cell is the average number of iterations it took the algorithm to generate the final solution and the bottom entry is the average number of new points added. We note the following from Table 1 : • For a fixed dimension n , problems with lower number of points were harder to solve. Indeed, the cases marked ‘ ∗ ’ in Table 1 , were not solved to tolerance as the number of iterations became excessive. This is probably because fewer than n ( n +1)/2+ n +1 (the number of ‘degrees of freedom’ in the triplet (M,w, γ) ) points were determining the separating quadratic, resulting in a higher number of additional points to be added to the original problem. Also, the average number of iterations seems to be relatively ‘insensitive’ to m . This is probably because roughly the same number of points in A ∪ B is determining the quadratic. • For a fixed number of points m , problems with higher dimension were harder to solve, as expected. 5.2 The Wisconsin Breast Cancer Database The previous section addressed random problems of small and moderate size in order to get some idea of the amount of extra work (in terms of number LPs) that is required to solve the model. In this section, we focus on a ‘real world’ set of data in order to get some idea of the accuracy of the model. We choose to use the Wisconsin Breast Cancer Database which was available through the internet at: http: // www.ics.edu / AI / ML / MLDBRepository.html. This database was described in the Introduction. The data in the Wisconsin Breast Cancer Database was added sequentially, and it is convenient to refer to the group number corresponding to when it was added (see Table 2 ). We note from Table 2 that the whole set consisting of all the groups one through eight is quadratically separable. From previous studies (see Mangasarian [2] and Falk and Lopez [6] ) we know that even the first group was not linearly separable. We selected the training groups, as if they were becoming sequentially available and tested them against the rest of the set. The results are tabulated in Table 3 . From Table 3 we see that it was harder to estimate the separating quadratic when there were fewer points in the training group and therefore, the accuracy of the model is not as high as when there are more points. Notice that the accuracy generally increased as the number of points used for training increased (as expected), but sometimes decreased! This is a reflection of the fact that different training sets may produce different separators. It also reflects the fact that ‘accuracy’ (as measured by the number of mis-classified points divided by the total number of points in the test set) is not directly related to the objective function of the model describing robust separators (maximize a minimum distance). The fact that robust quadratic separators are highly accurate for this data argues for their potential use in classification problems. Appendix A Finding the closest point on a quadratic We wish to determine the distance between a point p and a set of the form (A.1) S={x : (1/2)x T Mx+x T w−γ=0} where M is an n by n symmetric matrix, and w is an n vector. We will assume that this set is non-empty. The appropriate optimization problem is (A.2) min x∈S z=||x−p|| 2 . This problem clearly has a solution via Weierstrass's Theorem. Furthermore, by diagonalizing M if necessary and changing variables, we can assume that we are dealing with a separable quadratic function q(x)= 1 2 x T Mx+x T w−γ where M = diag ( M 1 ,…, M n ). A.1 The Lagrangian and the dual One way of approaching this problem is to define the dual function and maximize it over its domain (see, e.g., Lasdon [18] ). The Lagrangian of 2 is L(x, μ)=||x−p|| 2 −μ((1/2)x T Mx+x T w−γ) which can be written as (A.3) L(x, μ)= ∑ j=1 n (1−(μ/2)M j )x j 2 − ∑ j=1 n (2p j +μw j )x j +μγ+||p|| 2 . The dual function is d(μ)= min x∈R n L(x, μ). It is well-known that d is a concave function over convex subsets of its domain. The domain of this function is D={μ : L(x, μ) is minimized in x over R n }. We can now explicitly write down this set. It is convenient to divide the index set J ={1,…, n } into two sets J 0 ={j∈J : M j p j +w j =0}, J 1 =J⧹J 0 for reasons which will become apparent below. In most situations, J 0 =∅, i.e., J 1 = J . We term variables with indices in J 0 ‘variables of a special type’. Theorem 7 The domain of the dual problem is D= μ : max M j <0 {2/M j }<μ< min M j >0 {2/M j } unless (a) {k : 2/M k = max M j <0 {2/M j }}⊂J 0 in which case D= μ : max M j <0 {2/M j }⩽μ< min M j >0 {2/M j } or unless (b) {k : 2/M k = min M j >0 {2/M j }}⊂J 0 in which case D= μ : max M j <0 {2/M j }<μ⩽ min M j >0 {2/M j } or unless (c) both cases (a) and (b) occur , in which case D= μ : max W j <0 {2/W j }⩽μ⩽ min W j >0 {2/W j } . Proof For any variable x j the appropriate part of the Lagrangian is (1−(μ/2)M j )x j 2 −(2p j +μw j )x j . For a specific value of μ , if the coefficients (1−( μ /2) M j ) are positive for all j ∈ J , then μ ∈ D . If some coefficient is negative, then μ ∉ D . If, however, some coefficient (1−( μ /2) M j ) is zero then L(x, μ) is minimized if the coefficient (2 p j + μw j ) of x j is also zero in which case any value of x j minimizes for that μ value. In this special case: (1−(μ/2)M j )=0 which implies that neither M j =0 nor μ =0. We can thus write μ =2/ M j . Since also 2p j +μw j =0, eliminating μ , we find that M j p j +w j =0 i.e., j = J 0 . Thus the appropriate part of the Lagrangian can be re-written as 1 2 (2−μM j )x j 2 −(2p j −μM j p j )x j =(2−μM j )( 1 2 x j 2 −p j x j ). Note that the first term is 0 for the μ under consideration. The second term, however, is always minimized at x j = p j for feasible μ 's other than 2/ M j . Thus, for such other μ 's, the product is minimized whenever (2− μM j )>0. For negative M j 's (case (a)), this is true when μ >2/ M j , and for positive M j 's, (case (b)), this is true when μ <2/ M j . If there are several j 's for which a coefficient (1−( μ /2) M j ) is zero, they must all be of the special type where M j p j + w j =0. If even only one has 2 p j + μw j ≠0, the variable x j can either be increased or decreased without bound, driving L(x, μ) to −∞ and such a μ ∉ D . Summarizing, μ ∈ D if μ<2/M j for all j such that M j >0 and μ<2/M j for all j such that M j <0. If μ =max W j <0 {2/ W j } and if all j 's for which this is true are of the special type, then μ ∈ D . If μ =min W j >0 {2/ W j } and if all j 's for which this is true are of the special type, then μ ∈ D . □ We now characterize the concave function d by finding its analytic expression. Looking at the Lagrangian (3) , we see that for any μ ∈ D , the point x ( μ ) which minimizes the Lagrangian is (A.4) x j (μ)=(2p j +μw j )/(2−μM j ) assuming the denominator (2− μM j ) does not vanish. In D , the only place where this might happen is at an endpoint of one of the intervals of the last three cases of the above theorem. In such a case, x j ( μ ) can be chosen to have any value since it does not affect the value of the Lagrangian. Note that, however, for variables of the special type evaluated at points interior to D , we have x j (μ)=(2p j +μw j )/(2−μM j )=(2p j −μp j W j )/(2−μM j )=p j . In any case, we can now express the dual function as d(μ)= ∑ j=1 n (1−(μ/2)M j ) 2p j +μw j 2−μM j 2 − ∑ j=1 n (2p j −μw j ) 2p j +μw j 2−μM j +μγ+||p|| 2 =−(1/2) ∑ j=1 n (2p j +μw j ) 2 /(2−μM j )+μγ+||p|| 2 with the understanding that some of the terms (2 p j + μw j ) 2 /(2− μM j ) will reduce to p j 2 (2− μM j ) — a linear function in μ — whenever j ∈ J 0 . Note that, in any case, d ( μ ) is differentiable over it's domain. Let μ ∗ denote it's maximizing point. Any point in x(μ ∗ ) for which q ( x )=0 will solve the primal problem ( [9] , page 427). A.2 Small illustrative examples Example 1 Let M= 1 0 0 2 , w=(−1, 6), p= 1 −3 , γ=1 so that both M 1 p 1 + w 1 =(1)(1)−1=0 and M 2 p 2 + w 2 =(2)(−3)+6=0. From Theorem 2 we see that D={μ⩽1}. The Lagrangian is L(x, μ)=(x 1 −1) 2 +(x 2 +3) 2 −μ(0.5x 1 2 +x 2 2 −x 1 +6x 2 −1)=(2−μ)[( 1 2 )x 1 2 −x 1 ]+(1−μ)[x 2 2 +6x 2 ]+10+μ. For μ <1 the minimizing point is x(μ)=(1,−3) but at μ =1 any value of x 2 minimizes. Over (−∞, 1] the value of the dual function d ( u ) is L(x(μ), μ) : d(μ)=10.5μ which is clearly maximized at μ =1 and the optimal value is 10.5. Since x 1 =1, the specific value of x 2 must be determined so that the quadratic equation 0.5x 1 2 +x 2 2 −x 1 +6x 2 −1=0 is satisfied. Thus x 2 satisfies x 2 2 +6x 2 −1.5=0, i.e., either x 2 =−6.2404 or x 2 =0.24037. Both of these values are optimal — there are two optimal solutions of this problem as is illustrated in Fig. 10 . Example 2 Let M= 2 0 0 0 0 0 0 0 −1 , w= 1 −2 3 , γ=−1, and p= 1 2 3 . Note that the special condition M j p j + w j =0 occurs at component 3. We obtain x(μ)=((2+μ)/(2−2μ), 2−μ, 3) defined over the interval −2⩽μ<1. The dual function is ( Fig. 11 ) d(μ)=−(1/2)((2+μ) 2 /(2−2μ)+(4−2μ) 2 /(2)+(6+3μ) 2 /(2+μ))−μ+14 which is plotted here: Clearly the maximum is attained at about the point μ =−1. A more accurate approximation (obtained via Maple) is μ =−0.92773 and the minimum distance turns out to be 1.1755. A.3 Recovering the primal solution Recovering the solution x ∗ of the original problem is easy if • x j is not a variable of the special type, or if • x j is a variable of the special type but the optimal μ value is interior to D . In the former case one uses Eq. (4) , and in the latter case, Eq. (??). In Example 4 above, we use both to obtain x(μ)=((2−0.92773)/(2−2(−0.92773)), 2+0.92773, 3)=(0.27812, 2.9277, 3). If, however, x j is a variable of the special type but the optimal μ value is an endpoint D , any value of x j will minimize the Lagrangian. We need to satisfy the quadratic equation, and this will determine the optimal value(s) of x j . This was illustrated in the previous example where x 2 was found to have two values satisfying the quadratic. References [1] Ullman JR. Pattern recognition techniques. Crane, Russak & Company, 1973. [2] O.L. Mangasarian Linear and nonlinear separation of patterns by linear programming Operations Research 13 1965 444 452 [3] Mangasarian OL, Setiono R, Wolberg WH. Pattern recognition via linear programming: theory and application to medical diagnosis. In: Colman TF, Li Y, (editors,). Large scale numerical optimization. Philadelphia: SIAM, 1990: 22–30. [4] O.L. Mangasarian W.H. Wolberg Cancer diagnosis via linear programming SIAM NEWS 23 1990 5 [5] Preparata FP. Computational geometry, an introduction. Paris: Springer-Verlag, 1990. [6] Falk JE, Lopez-Cardona E. The surgical separation of sets. Journal of Global optimization 1997;1–31. [7] A. Stam Nontraditional approaches to statistical classification: some perspectives on L p -norm methods Annals of Operations Research 74 1997 1 36 [8] T.M. Cavalier J.P. Ignizio A.L. Soyster Discriminant analysis via mathematical programming: certain problems and their cases Computers and Operations Research 16 4 1989 353 362 [9] J.B. Rosen Pattern separation by convex programming Journal of Mathematical Analysis and Applications 10 1965 123 134 [10] A.P. Duarte Silva A. Stam Second order mathematical programming formulations for discriminant analysis European Journal of Operations Research 72 1994 4 22 [11] N. Freed F. Glover Simple but powerful goal programming formulations for the discriminant problem European Journal of Operations Research 7 1981 44 60 [12] N. Freed F. Glover Evaluating alternative linear programming models to solve the two-group discriminant problem Decision Sciences 17 1986 151 162 [13] F. Glover S. Keene B. Duea A new class of models for the discriminant problem Decision Sciences 19 1988 269 280 [14] W.J. Banks P.L. Abad On the performance of linear programming heuristics applied on a quadratic transformation in the classification problem European Journal of Operations Research 72 1994 23 28 [15] C. Smith Some examples of discrimination Annals of Eugenics 13 1947 272 282 [16] Marlow WH. Mathematics for operations research. New York: Wiley, 1978. [17] J.W. Blankenship J.E. Falk Infinitely constrained optimization problems Journal of Optimization Theory and Applications 19 2 1976 261 281 [18] Lasdon LS. Optimization theory for large systems. New York: Macmillan, 1970. Karlov is currently a financial analyst with Pricewaterhouse Coopers in New York, where he has been since January of 1998. Previously he was employed at Ernst and Young in Washington, DC. He received his doctorate from The George Washington University. His BS and MS degrees were granted by Moscow State University. Falk is a member of the Engineering Management and Systems Engineering Department at The George Washington University and also has been serving on the faculty of the American University of Armenia during the summertimes. He has also been associated with the State University of Kiev (Ukraine), University of Nitra (Slovakia), Norwegian Institute of Technology (Trondheim, Norway), and Bogazici University (Istanbul, Turkey). His primary interests lie in modeling and non-convex optimization.
Year
DOI
Venue
2001
10.1016/S0305-0548(99)00134-3
Computers & OR
Keywords
DocType
Volume
finite set,robust separation
Journal
28
Issue
ISSN
Citations 
6
Computers and Operations Research
11
PageRank 
References 
Authors
1.03
2
2
Name
Order
Citations
PageRank
James E. Falk129768.47
Vladimir E. Karlov2111.03