Informatica 38 (2014) 125-133 125
Optimizing the Classification Cost using SVMs with a Double Hinge Loss
Amirou Ahmedf, Ould-Abdeslam Djaffar{ and Zidelmal Zahiaf fMouloud Mammeri University , Tizi-Ouzou, Algeria E-mail: a-amirou@mail.ummto.dz, djaffar.ould-abdeslam@uha.fr
Aidene Mohamedf and Merckle Jean{
{MIPS Laboratoiry, Haute Alsace University, France
E-mail: m-aidene@mail.ummto.dz
Keywords: support vector machines, double hinge loss, rejection, classification cost Received: March 6, 2014
The objective of this study is to minimize the classification cost using Support Vector Machines (SVMs) Classifier with a double hinge loss. Such binary classifiers have the option to reject observations when the cost of rejection is lower than that of misclassification. To train this classifier, the standard SVM optimization problem was modified by minimizing a double hinge loss function considered as a surrogate convex loss function. The impact of such classifier is illustrated on several discussed results obtained with artificial data and medical data.
Povzetek: Predstavljena je optimizacija cene klasificiranja z metodo strojnega učenja SVM.
1 Introduction
Support Vector Machines (SVMs) are becoming one of the most popular pattern recognition schemes due to their remarkable generalization performance. This is motivated by the application of Structural Risk Minimization principle [1, 2]. Because of their good performance in terms of accuracy and generalization, SVMs are frequently used in very complex two-class classification problems.
Even though the generalization performance of support vector classifiers, misclassifications cannot be completely eliminated and, thus, can produce severe penalties. The expected error of a prediction is a very relevant point in many sensitive applications, such as medical diagnosis or industrial applications.
To improve the reliability of classifiers, new machine learning algorithms have been introduced such us con-formal prediction determining levels of confidence [3]. Hence, classifications with less confidence than a given threshold may be rejected. This also motivates the introduction of a reject option in classifiers, by allowing for a third decision ® (Reject) when the conditional probability that an example belongs to each class is close to 1/2 .
Rejecting ambiguous examples has been investigated since the publications of [4, 5] on the error reject tradeoff. A notable attempts to integrate a reject rule in SVMs has been presented in [6]. The authors developed an SVM whose reject region is determined during the training phase. They derived a novel formulation of the SVM training problem and developed a specific algorithm to solve it. Some works have proposed rejection techniques using two thresholds on the output of the SVM classifier and produce a reject region delimited by two parallel hyperplane in the
feature space [7, 8]. Other works used mixture of classifiers [9]. This approach is computationally highly expensive.
Recently, some remarkable works have proposed SVM classifier with a reject option using a double hinge loss function. This option was proposed in [10, 11, 12, 13]. The formulation in [10, 11, 12] is restricted to symmetric losses. In [13], the authors have proposed a cost-sensitive reject rule for SVM using an asymmetric double hinge loss. This formulation is based on probabilistic interpretation of SVM published in [14, 15] providing accurate estimation of posterior probabilities. It also generalizes those suggested in [11, 12] to arbitrary asymmetric misclassification and rejection costs. In all these model classifiers, the reject region is defined during the training phase.
In this paper, we develop the training criterion for a generalized SVM with a double hinge loss and then compare the performance of symmetric and asymmetric classification. The optimal classification cost and the error-reject tradeoff have been highlighted through several illustrated tests.
Note that the minimal classification cost must correspond to a good error-reject tradeoff. It is desirable that most of the rejected patterns would have been erroneously classified by the ideal Bayes classifier.
The remainder of this paper is structured as follows. After problem setting in section 2, section 3 recalls Bayes rule with rejection. In section 4, SVM classifier with rejection is developed using the generalized double hinge loss function. After this, the training criterion is detailed. In Section 5, the implementation is tested empirically. Il shows results comparing the considered classifiers. Finally, Section 6 briefly concludes the paper.
126 Informatica 38 (2014) 125-133
A. Amirou et al.
2 Problem setting
Let us consider a binary classification problem in which each example belongs to one of two categories. A discriminant f : X ^ R classifies an observation x e X into one of two classes, labeled +1 or -1 . Viewing f (x) as a proxy value of the conditional probability P = P(Y = 1|X = x), one is less confident for small values of | f (x) | corresponding to P around 1/2. The strategy used in this work is to report sgn(f (x4)) = +1 or -1 if |f (x4)| exceeds a threshold Sj and no decision otherwise. In binary problems, the two types of errors are:
-	FP: False Positive, where examples labeled -1 are cate-
gorized in the positive class, incurring a loss Cn
-	FN: False Negative, where examples labeled +1 are cat-
egorized in the negative class, incurring a loss Cp.
We also assume that the decision @ incurs a loss, Rn and Rp for rejected examples labeled -1 and +1, respectively. This formulation corresponds to [13] . For symmetric classification [10, 11, 12], we have Cp = Cn = 1 and Rp = Rn = r with 0 < r < 1/2. The expected losses pertaining to each possible decision d are displayed in Figure 1, assuming that all costs are strictly positive. The lower risk is:
R(d) = min{CpP(x), Cn(1 - P(x)), RpP(x) + Rn(1 - P(x))}
(1)
C
R
C
R
P-	0.5 P+	1
posterior probability P(x)
3 Bayes rule with rejection
From Figure 1, we deduce that Bayes classifier d* defined as the minimizer of the risk R(d) can be expressed simply, using two thresholds:
P+
P
Cn Rn
Cn Rn + Rp
Rn
Cp Rp
n
(2) (3)
corresponding to symmetric thresholds P_ = r and P+ = 1 - r in [10, 11, 12].
As Bayes decision rule is defined by conditional probabilities, many classifiers first estimate the conditional probability P(Y = 1|X = x), and then plug this estimate in Eq.4 to build the decision rule.
+1
f *(x) = < -1 0
if P(Y = 1|X = x) > P+ if p(y = 1X = x) < P_
otherwise .
(4)
where P(x) denotes P(Y = 1|X = x). According to (1), one can see in Figure 1 that rejecting a pattern is a viable option if and only if the point G is located above the
R R
segment AB. In other terms, if and only if — + -Rn < 1 corresponding to 0 < r < 1/2 in [10, 11, 12].
where f *(x) corresponds to the decision d*, minimizer of the risk (1).
4 SVM classifier with Reject option (SVMR)
To minimize the empirical counterpart of the risk (1) computationally not feasible, one could replace it by surrogate loss functions. The most popular are the hinge loss motivated by [1] leading to sparse solutions [13, 12] and the logistic regression model offering ability to estimate the posterior probability P(Y = 1|X = x) = 1/(1 + exp(-yf (x))) and then a good choice of the thresholds In this study, P(Y = 1|X = x) have to be accurate only in the neighborhood of P+ and P_ (see equation 4).
4.1 Double hinge loss
The generalized double hinge loss introduced in [13] is a convex and piecewise linear loss function that is tangent to the negative log-likelihood loss at S+ = log(P+ /(1 - P+)) and at S_ = log(P_/(1 - P_)) (see Figure 2). This proposal retains the advantages of both loss functions mentioned above: the sparsity of the hinge loss and the ability of the neg-log-likelihood loss to estimate the posterior probability P+ and P_, respectively at the tangency points S+ and S_. So the decision rule can be expressed as:
( +1 if f (x) > S+ f (x) =4 -1 if f (x) < S_ 0 otherwise .
(5)
Figure 1: Expected losses against posterior probabilities
These thresholds are symmetric in [10, 11, 12], S+ = —S_ = So and the recommended value of So belongs to the interval [r, 1 - r]. To express the generalized double
p
n
p
n
0
0
Optimizing the Classification Cost using.
Informatica 38 (2014) 125-133 127
The second slop of W(+1, f (x)) is W = a2f(x) + g2
f(x)
Figure 2: Double hinge loss function for positive examples, with P- = 0.35 and P+ = 0.6 (solid: double hinge, dashed: likelihood)
hinge function [13], we consider firstly the standard logistic regression procedure where ^ is the negative log-likelihood loss:
¥>(y, f (x)) = 1og(1 + exp(-yf (x)))
(6)
i.i
\ V \ \\ \\ \\ \\			
V			
N ' \	\ \ N \ \ \		
i	5 -	iN \ \ ^^ \ . 5 + X. 1-1-'-11	
y f(x,)
Figure 3: Double hinge loss function for positive examples, with P- = 0.4 and P+ = 0.7 (solid: double hinge, red dashed: likelihood)
At the tangency point we have y(+1,/(x)) = W (+1,/ (x)), hence gi = -p+1og(P+)-(1-P+)1og(1-P+) = H (P+).
where a2
_ d[y(+1,/(x))] i _
d[f (x)]
_ = -(1 - P-) and g2 =
-P-log(P-) - (1 - P-)1og(1 - P-) = H(P-).
For a1f (x) + g1 =0, we have f (x) = t+ = f-^ and for aif (x) + gi = «2f (x) + g2, we have f (x) = p+ = g(Pp+)-p_(P+). The double hinge function for positive examples is then expressed as: W (+1,f (x)) =
-(1 - P_)f (x) + H(P_) if f (x) <p+ -(1 - P+)f (x) + H(P+) if p+ < f (x) <T+ 0	otherwise,
(7)
The same strategy of calculation leads to the double hinge function for negative examples: W(-1, f(x)) =
P+f (x) + H (P+) P_f (x) + H (P-) 0
if f (x) > p- , if T- > f (x) > p-otherwise.
(8)
where t- = H(P -1 and p- = p+ = p. The double hinge loss 0r introduced in [10, 11, 12] is a scaled version of the loss W. It is given by 0r (y/(x)) =
1 - ^yf (x) 1 - yf (x)
0
that is y(+1, /(x)) = 1og(1 + exp(-/(x))) for positive examples and y(-1, /(x)) = 1og(1 + exp(/(x))) for negative examples. Let us work on Figure 3 corresponding to positive examples (y = +1).
W = a1/(x) + g1 is the first slop (right to left) of W(+1,/(x)) where ai =	f»1 = -(1 - P+).
2.1
if yf (x) < 0 if 0 < yf (x) < 1 otherwise
(9)
hence,
A (yf (x)) =
H (r)
W
y.^rl f (x)
where H(r) = H(P-) and H(P-) = H(P+) in the symmetric case. Note that, although minimizing 0r (y/(x)) or W will lead to equivalent solutions for /. With minimizing 0r (y/(x)), the decision rule recommended by [11] classifies an example when |/(x)| > Jo = 2, while in [13], an example is classified when |/(x)| > -H>ylogl-r. The last decision rule rejects more examples when the loss incurred by rejection is small and fewer examples otherwise. The two rules are identical for r = 0.24.
4.2 Training Criterion
As in standard SVMs, we consider the regularized empirical risk on the training sample. Introducing the double hinge loss (7-8) results in an optimization problem that is similar to the standard SVMs problem.
4.2.1 Primal problem
Let Co a constant to be tuned by cross-validation, we define D = Co(P+ - P-), Bi = Co(1 - P+) for positive examples and Bi = CoP- for negative examples. The primal optimization problem reads
mmf, b 11
En i=1 d En
2+
H+
Bi
. Ti - yi(f(xi) + b)|+ 1 |p - yi(f (xi) + b)|+
+
(10)
8
8
0
+
1
0
128 Informatica 38 (2014) 125-133
A. Amirou et al.
where | • |+ = mai(-, 0). The (squared) norm of f is a regularization functional in a suitable Hilbert space. The primal problem (10) is best seen with introduction of slack variables £ and n shown in Figure(3).
+d ^
m
min/,b,«,n ^ 11/Il H + J2-
i=i	¿=1	r,,s
Sc	yi(/(xi)+ b) > Ti - &i, i = 1,..., n (11)
yi(/(xi) + b) > p - ni, i = 1,..., n & > 0 , n > 0, i = 1,. .., n.
4.2.2 Dual problem
The Lagrangian of (11) is given by:
L(/, b,&, n, a,A, u, W = 1 Il/II2 + E?=i +D En=i ni - En=i ai [yi(/(x0 +b) - Ti + &] - En=i A [yi(/(xi) +b) - p + ni]
En	. v^n
n=i ui&i - Ei=i
with:
Ui > 0, w > 0, ai > 0, Ai > 0, and i = 1, ..., n. The Kuhn-Tucker conditions imply:
4.2.3 Solving the problem
To solve the dual (15), the strategy used in the active set method [17] is considered. Firstly, the training set is partitioned in support and non support vectors. the training criterion is optimized considering this partition. Then, this optimization results in an updated partition of examples in support and non-support vectors. These two steps are iterated until predefined level of accuracy is reached. Table (1) shows how the training set is partitioned into five subsets defined by the constraints in (15).
The outcomes of the membership of example i to one of the subsets described above has the following consequences on the dual variables (a, 7):
(12)
i € /0
i € /t
i € /b i € /p i € Id
ai =0	Yi
0 < ai <	Bi Yi
ai = Bi	Yi a a
0
ai Bi
= Bi Bi < Yi < Bi + D = Bi Yi = Bi + D
(16)
dL	n
^ = 0	^ ]T(ai + Ai)yi = 0 i=i
dL	n
/ = 0	^ E/(•) = (ai + A)yik(-,*i)
J	i=i	(13)
dL
—	= 0	^ Bi - Ui - ai =0 ^ 0 < ai < Bi
SL
—	= 0	^ D - w - A = 0 ^ 0 < Ai < d dni
Hence, provided that the partitioning is known, has to be computed only for i G 1T U . Furthermore, «j is either constant or equal to 7».
We saw that, assuming that the examples are correctly partitioned, problem 15 can be solved by considering a considerably smaller problem, namely the problem of computing 7j for i G 1T U Ip. Let Ic = |/B ,/D} and = |/T, Ip}. The problem (15) becomes:
l(y) = 1YTgy - (S)TY
for i = 1,..., n. Thanks to these conditions, we can eliminate /, & and n from the Lagrangian.
L(a, A) = i (a + A)TG(a + A) - tta - pTA Sc	yT (a + A)=0
0 < ai < Bi,i = 1,. .., n 0 < Ai < D, i = 1,...,n
2
Sc: y y = 0
0 < Yi < C, i =
(17)
1,
, n and Ci = Bi + D
The relation between the parameters of the preceding formulation and the initial parameters of the problem (11) can be obtained after formulating the Lagrangian of the dual (17)
where T
(14)
(t 1,..., r„)T et p = (pi,..., p„)T are the threshold vectors of IR", G is the n x n influence matrix with general term Gj = y^-k(xj, Xj) and k(.,.), is the reproducing kernel of the Hilbert space H. Let 7 = a + A, the problem (14) can be rephrased as:
^^ v) =
i ytgy - STY + ayty - Y + MT (y - Cn )
(18)
where the Lagrange multipliers A, v must be positive or null and D", a vector of 1. This Lagrangian can be compared with the Lagrangian of the primal (11) reformulated as follows:
maxa,Y -2YT Gy + (t - p)T a + pT y , Sc y y = 0 ,
0 < ai < Bi, i = 1,.. ., n , 0 < Yi - ai < D, i = 1,.. ., n
2 - E!=i Yiyi/(xi) - EI=i Yiyib
¿(t - pHEn=i YiP
(15)
L = 2 I
+ E n=
+ En=i &i(Bi - ai - Ui)
+ En=i ni(D - Ai - Wi)
(19)
The problem (15) is a quadratic problem under box constraints. Compared to the standard SVM dual problem, one has an additional vector to optimize, but we will show that a is easily recovered from 7.
by replacing the variable / by y, the problem (19) becomes:
b & n) =
i YT gy + bYTy - Sty +(a + u - B) + nT(y + w - Dn)
(20)
Optimizing the Classification Cost using.
Informatica 38 (2014) 125-133 129
1o	saturated part of the loss	Io = {i|Vi(f (xi) + b) >t}
It	first hinge of the loss	It = {i|Vi(f (Xi) + b) = t }
Ib	first slop of the loss	Ib = {i|p < Vi(f (xi) + b) < t}
Ip	second hinge of the loss	Ip = {i|vi(f (xi) + b) = p}
Id	second sop of the loss	Id = {i|vi(f (xi) + b) < p}
Table 1: Partitioning the training set
To reveal the relations between the primal and dual variables, we will check the KKT conditions stipulating the cancellation of the gradient of the Lagrangian (20) according to the primal variable y in the different subsets.
Table 2 describes the properties of each set regarding the original variables and the Lagrange multipliers.
4.2.4 Algorithm
Let us assume the repartition in each set (I0, Ih and Ic) to be known. Only the values of 7 belonging to Ih remain unknown, they will then be given by the solution of the following optimization problem whose dimension is lower than initial dimension. After slightly abusing notations, we define: Yh = Y(Ih), Vh = y(Ih), Ghh = G(Ih,Ih), cc = J2(ieiB) BiVi + £(ieiD) CiVi and
s _
(rh(IT)Tp(Ip)T)T - G(Ih, Id)(B(Ib)TD(Id)t)t. The problem (17) becomes:
L(Yh) Sc
1YhGhhYh - ShYh vl Y h + cc = 0
(21)
The Kuhn-Tucker conditions gives us the system to be solved to find the values of y that are still unknown.
GhhYh = Sh - vh A VT Y h = -cc
(22)
After resolving this system, a component of 7 violating the primal or dual constraints must be moved to the suitable set. The process is iterated until all box constraints are satisfied.
During the learning process, the time consuming step is the resolution of the linear system (22). For this, we used the incremental strategy outlined in [18] whose complexity is close to O(n2) The presented SVMR computational complexity is comparable to that of the standard SVM [18]. The only computational overhead is that the presented SVMR uses 5 categories of examples while SVM uses three.
5 Results and discussions
Data:
To evaluate the performance of the SVMR classifiers, three types of data have been used:
-	synthetic data generated with a classical dataset with two gaussianly distributed classes with similar variances but different means chosen to create many ambiguous examples.
-	as medical decision making is an application domain for which rejection is of primary importance, data related to medical problems will be considered. ElectroCardioGram (ECG) records from (www.physionet.org /physiobank/database/mitdb) are used. Each tape is accompanied by an annotation file. in which ECG beats have been labeled by expert cardiologists. Since this study is to evaluate the performance of a binary classifier with a reject option, we followed the AAMI recommended practice [19] to form two heartbeat classes: (i) the positive class representing the ventricular ectopic beats (V); (ii) the negative class representing the normal beats (N), including Normal beats, Left Bundle Branch Block beats (LBBB) and Right Bundle Branch Block beats (RBBB). In agreement with [19], records containing paced beats (102, 104, 107, 217) and 23 records with no V beat or less than 40 V beat were excluded leaving 21 records of interest. We have stored each beat by a 7-feature vector. The feature extraction is described in [20]
-	For experimenting with large data, the forest CoverType database from UCI repository was also used. (http://kdd.ics.uci.edu/databases/covertype/). We consider the subproblem of discriminating the positive class Cottonwood (2747 examples) against the negative class Douglas-fir (17367 examples).
Tests:
The first series of experiments are done with the ECG data to explain the effectiveness of the classification with rejection. We selected record 214 and 221 containing together 3546 of N beats and 652 of V beats. As no cost matrix is provided with this data, we assume that Rp = Rn = r as in [10, 11, 12] and P+ = 1 - C. P- = 1 - 9P- where
1	cn
9 =1 in [10, 11, 12] and 9 > 1 in [13]. Often, in practice, especially in medical applications, FN errors are more costly than FP errors (9 > 1). Figures 4 and 5 show respectively an example of the reject region produced by the SVMR classifier for 9 = 1 and for 9 > 1. In Figure 5, the SVMR classifier encourages the rejection of more FN examples because they are more costly than FP examples.
All previous classifiers comparatives studies have been based on the error rates obtained, but error rate is not the
130 Informatica 38 (2014) 125-133
A. Amirou et al.
Set	Initial constraints	Primal constraints	Dual constraints
1o	(xi) + b] > Ti	Ci = ni = 0	y = 0, v = Gy + by - t = 0
it	(xi) + b] = Ti	Ci = ni = 0	y = 0, v = Gy + by — t = 0
ib	p < yi[f (xi) + b] < Ti	Ci = 0, ni =0	v = 0, y = —Gy — by + t = C
Ip	yi[f (xi) +b] = p	Ci = 0, ni =0	v = 0, y = Gy + by — p = 0
id	yi[f (xi) + b] < p	Ci = 0, ni =0	v = 0, y = —Gy — by + p = n
Table 2: Situation of the constraints for the five types of examples
Figure 4: Scatter plot showing the reject region induced by the reject thresholds in correspondence to the costs of misclassifying and rejecting samples. Positive cases are represented by black asts and negative cases by red circles. The lines +0.5 and -0.5 correspond respectively to So and So and the line 0 corresponds to f (x) = 0 or P(Y =1 | X = x) =0.5.
Figure 5: Scatter plot showing the reject region induced by the reject thresholds in correspondence to the costs of mis-classifying and rejecting samples. Positive cases are represented by black asts and negative cases by red circles. The lines +0.32277 and -0.8473 correspond respectively to S+ and S_ and the line 0 corresponds to f (x) =0 or P(Y = 1 | X = x) = 0.5.
only measurement that can be used to judge a classifier's performance. In many applications, the classification cost is a parameter witch will be considered since Bayes classifiers with or without rejection aim to minimize the classification cost.
For illustration, we compare the reject rates obtained with the SVMR classifiers proposed in [10, 11, 12] where the reject threshold So e [r, 1 - r] and the SVMR classifier proposed in [13] where the reject thresholds are S+ = log(P+ /(1 - P+)) and S_ = log(P_/(1 - P_)) respectively for positive and negative examples. For this purpose, we consider the symmetric classification, P+ = 1 - P_. Figure 6 and 7 obtained with synthetic data and ECG data (record 214 and 221) show that in all cases, the decision rule [13] rejects fewer examples when the loss incurred by rejection is high and more examples otherwise. The rule in [10, 11, 12] considers the reject threshold So = 1 - r as the largest value of So and then rejects more examples for all reject costs. For So = r, this rule rejects less frequently especially when r close to zero, it becomes almost with no rejection. For the middle value So = 0.5 seen as a compro-
mise among r and 1 - r, the rule [10, 11, 12] and the one proposed in [13] are identical at r = 0.24.
As pointed out in [5], the advantage of classifying with rejection can be judged by the error-reject tradeoff. Since the error rate E and the reject rate R are monotonic functions of r. We compute the tradeoff E versus R from E(r) and R(r) when r varies between 0.5 and 0.12 and the threshold So = 0.5 recommended in [11, 12]. Figure 8 shows the error reject tradeoff for the rule proposed in [13] (black curves) and for the rule proposed in [10, 11, 12] (red curves). The obtained results differ due to the size of the rejection region induced by the rules. From these results, we can conclude another interesting parameter that is the error-reject ratio defined in [5] that is ^R (dashed lines). For high reject costs (0.4 < r < 0.5), the rule [13] indicates an error-reject ratio of -0.58, -0.84 and -0.42 respectively for synthetic data, ECG data and forest data. This means that 58%, 84% and 42% respectively of the rejected patterns would have been erroneously classified. Using the rule proposed in [10, 11, 12] with So = 0.5, the error-reject ratios obtained are -0.15 for synthetic data, -0.23 for ECG
Optimizing the Classification Cost using.
Informatica 38 (2014) 125-133 131
Figure 6: Comparison of the reject rate versus the reject cost r obtained with the SVMR in [13] (black curves) and with the SVMR introduced in [10, 11, 12] (red curves). These results are obtained with synthetic data.
Figure 7: Comparison of the reject rate versus the reject cost r obtained with the SVMR in [13] (black curves) and with the SVMR introduced in [10, 11, 12] (red curves). These results are obtained with medical data.
data and -0.22 for forest data. This means that only 15%, 23% and 22% respectively of the rejected patterns would have been erroneously classified. Hence, it is clear that the rule [13] should lead to a better classification cost.
The last series of tests was carried out using all the selected ECG records. The mean results obtained are reported in Figures 9 and 10. The error against the reject decreases until a quasi constant rate (Figure 9) . Another interesting plot in the same figure represents the error reject ratio. The inflection point in this plot is interesting since it indicates the most important variation of the error against the variation of the reject rate. Two statistical parameters are also used to highlight the performance of the reject rule [13]. The sensitivity and positive predictivity are computed by
TP	„ TP
Se =
TP + FN'
Pp
TP + FP
where True Positive (TP) are the samples labeled +1 categorized in the positive class. Figure 10 (top) indicates the variation of the classification cost given by
Cc = [Cp FN + CnFP + rRrej ]/Nto
(23)
where Rrej is the number of rejected patterns and Ntot, the total number of examples. The same Figure shows that the optimal classification cost Cc corresponds to a good error-reject tradeoff (see Figure 9). Figure 10 (bottom) shows that the positive predictivity is close to 99.8%. In the same figure, it is shown that we obtained more than 98, 2% of sensitivity with no rejection and more than 99% of sensitivity for the minimal classification cost with rejection considering Rp = Rn = r and Cn = 1 and 0 = 1.2. In the same figure, it is clearly shown that the optimal classification cost is not obtained for r=0.5 (simple Bays rule) but for a rejection rate equal to 1.8%. In any application, one must choose the error rate and the rejection rate corresponding to the minimal classification cost. It is the goal of using a cost sensitive classifier.
For a better appreciation of such reject schemes, it should be desirable to perform tests on data accompanied by real cost matrix.
Even though the considered classifier based on sparse probabilistic interpretation of SVM, providing an accurate estimation of posterior probabilities, it should be interesting to assign confidence values to each classification. This can be considered by introducing conformal predic-
132 Informatica 38 (2014) 125-133
A. Amirou et al.
	r = 0.5 6 =0.5 o
	r = 0.1
\ \	
0 10 20 35 50 60 Reject rate (%)
S =0.5
13
£ 6
eject rate (%)
0 10 20 30 50, 70 90 reject rate(%
c
a
b
8
3
80
Figure 8: Error versus reject tradeoff obtained using synthetic data (a), ECG data (b) and forest data (c); with [13] (black curves) and with [10, 11, 12] using So = 0.5(red curves).
Figure 9: Top: Error rate vs. Reject rate. Bottom: Variation of the error rate against the variation of the reject rate
Figure 10: Top: Classification cost against Reject Rate. Bottom: Sensitivity (black curve) and positive predictivity (red curve) against reject rate.
tion whose relationship with rejection is clearly relevant, whether the rejection related to the ambiguity of examples or that related to their atypical characters.
6 Conclusion
This paper presents a cost-sensitive reject rules for SVMs using a double hinge loss. The solution inspired by the probabilistic interpretation of SVM, owns the advantage of the hinge loss function which leads to a consistent solution and the advantage of negative log-likelihood loss which allows a good estimation of posteriori probabilities in the vicinity of the decision thresholds. Note that these dynamic reject thresholds follow the cost of rejecting a sample and the cost of misclassifying a sample. This viewpoint aims to
minimize the classification cost.
A possible improvement of this study is to estimate the level of confidence of the classifier by introducing the con-formal prediction. This will be a crucial advantage, especially for medical applications, the risk of clinical errors may be controlled by an acceptable level of confidence for a given decision.
References
[1]	V. N. Vapnik (1995) The Nature of Statistical Learning Theory Springer Series in Statistics
[2]	N. Cristianini and J. Shawe-Taylor (2000) An Introduction to Support Vector Machines, Cambridge University Press.
Optimizing the Classification Cost using.
Informatica 38 (2014) 125-133 133
[3]	Glenn Shafer and Vladimir Vovk (2008) A Tutorial on Conformal Prediction, Journal of Machine Learning Research, 9, pp 371-421.
[4]	C. K. Chow. (1957) An optimum character recognition system using decision function, IRE Trans. Electronic Computers, EC-6(4), pp. 47-254.
[5]	C. K. Chow (1970) On optimum recognition error and reject tradeoff, IEEE Trans. on Information Theory, 16(41),pp. 41-46.
[6]	G. Fumera and F. Roli (2002) Support vector machines with embedded reject option, In S.-W. Lee and
A.	Verri, editors, Pattern Recognition with Support Vector Machines: First International Workshop, volume 2388 of Lecture Notes in Computer Science-Springer, pp. 68-82.
[7]	J. T. Kwok (1999) Moderating the outputs of support vector machine classifiers, IEEE Trans. on Neural Networks, 10(5), pp 1018-1031.
[8]	F. Tortorella. (2004) Reducing the classification cost of support vector classifiers through an ROC-based reject rule, Pattern Analysis and Applications, 7(2) pp. 128-143.
[9]	A. Bounsiar, E. Grall, P. Beauseroy. (2007) A Kernel Based Rejection Method for Supervised Classification. , International Journal of Computational Intelligence, 3(4), pp. 312-321.
[10]	R. Herbei and M. H. Wegkamp (2006) Classification with reject option, The Canadian Journal of Statistics, 34(4), pp. 709-721.
[11]	P. L. Bartlett and M. H. Wegkamp (2008) em Classification with a reject option using a hinge loss, Journal of Machine Learning Research, 9, pp. 1823-1840.
[12]	M. Wegkamp, M. Yuan (2010) em Classification methods with reject option based on convex risk minimization, Journal of Machine Learning Research, (11), pp. 111-130.
[13]	Y. Grandvalet, A. Rakotomamonjy, J. Keshet et S. Canu (2009) em Support Vector Machines With a Reject Option, Advances in Neural Information Processing Systems, (21), pp. 537-544.
[14]	Y. Grandvalet, J. Mariethoz, and S. Bengio (2006) A probabilistic interpretation of SVMs with an application to unbalanced classification. In Y. Weiss,
B.	Scholkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18 MIT Press, pp. 467-474.
[15]	P. L. Bartlett and A. Tewari (2007) Sparseness vs estimating conditional probabilities: Some asymptotic results, Journal ofMachine Learning Research, 8, pp. 775-790.
[16]	Z. En-hui and Z. Chao, S. Jian and L. Chen (2011) Cost-sensitive SVM with Error Cost and Class-dependant Reject Cost, International Journal of Computer Theory and Engineering, 3(1).
[17]	S. V. N. Vishwanathan, A. Smola, and N. Murty (2003) SimpleSVM. In T. Fawcett andN. Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning AAAI, pp. 68-82.
[18]	G. Loosli, S. Canu, S. Vishwanathan, and M. Chat-topadhay (2005) Boite outils SVM simple et rapide.
Revue d'Intelligence Artificielle RIA, 19(4), pp.741767, 2005.
[19]	R. Mark and R Wallen, (1987) AAMI-recommended practice: Testing and reporting performance results of ventricular arrhythmia detection olgorithms, Association for the Advencement of Medical Instrumentation, Tech. Rep. AAMIECAR, 1987.
[20]	Z. Zidelmal, A. Amirou et A. Belouchrani. (2012) Heartbeat classification using Support Vector Machines (SVMs) with an embedded reject option,International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), vol,26,no. 1,D0I:10.1142/S0218001412500012.