A Duality Gaps with Fenchel Duality
This section is taken from the appendix D of Julien Mairal’s PhD thesis [19].
We are going to use intensively Fenchel Duality (see [2]).
Let us consider the problem
  [g(w)   f(w) + λψ(w)],
(47) 
We first notice that for all the formulations we have been
interested in, g(w) can be rewritten
g(w) = f(X^{⊤}w) + λψ(w),
(48) 
where X=[x^{1},…,x^{n}] are training vectors, and f is an
appropriated smooth realvalued function of ℝ^{n},
and ψ one of the regularization functions we have introduced.
Given a primal variable w in ℝ^{p} and a dual variable κ in
ℝ^{n}, we obtain using classical Fenchel duality rules [2],
that the following quantity is a duality gap for problem (47):
δ(w,κ)   g(w) + f^{∗}(κ) + λψ^{∗}(−Xκ / λ),

where f^{∗} and ψ^{∗} are respectively the Fenchel conjugates
of f and ψ. Denoting by w^{⋆} the solution of
Eq. (47), the duality gap is interesting in the sense that
it upperbounds the difference with the optimal value of the function:
δ(w,κ) ≥ g(w)−g(w^{⋆}) ≥ 0.

Similarly, we will consider pairs of primaldual variables (W,K) when
dealing with matrices.
During the optimization, sequences of primal variables w are available,
and one wishes to exploit duality gaps for estimating the difference
g(w)−g(w^{⋆}). This requires the following components:

being able to efficiently compute f^{∗} and ψ^{∗}.
 being able to obtain a “good” dual variable κ given a primal
variable w, such that δ(w,κ) is close to
g(w)−g(w^{⋆}).
We suppose that the first point is satisfied (we will detail these computations
for every loss and regularization functions in the sequel), and explain how to
choose κ in general (details will also be given in the sequel).
Let us first consider the choice that associates with a primal variable w, the
dual variable
and let us compute δ(w,κ(w)).
First, easy computations show that for all vectors z in ℝ^{n},
f^{∗}(∇f(z)) = z^{⊤}∇f(z)−f(z),
which gives
 δ(w,κ(w))  =  f(X^{⊤}w) + λ ψ(w) + f^{∗}(∇f(X^{⊤}w)) + λψ^{∗}(−X∇f(X^{⊤}w)/ λ),  (50) 
 =  λ ψ(w) + w^{⊤}X∇f(X^{⊤}w) + λψ^{∗}(−X∇f(X^{⊤}w)/ λ).
 (51) 

We now use the classical FenchelYoung inequality (see, Proposition
3.3.4 of [2]) on the function ψ, which gives
 δ(w,κ(w)) ≥ w^{⊤}X∇f(X^{⊤}w) − w^{⊤}X∇f(X^{⊤}w) = 0,


with equality if and only if −X∇f(X^{⊤}w) belongs to
∂ ψ(w). Interestingly, we now that firstorder optimality
conditions for Eq. (48) gives that
−X∇f(X^{⊤}w^{⋆}) ∈ ∂ ψ(w^{⋆}).
We have now in hand a nonnegative function w ↦ δ(w,κ(w)) of w, that
upperbounds g(w)−g(w^{⋆}) and satisfying δ(w^{⋆},κ(w^{⋆}))=0.
This is however not a sufficient property to make it a good measure
of the quality of the optimization, and further work is required,
that will be dependent on f and ψ.
We have indeed proven that δ(w^{⋆},κ(w^{⋆})) is always 0. However,
for w different than w^{⋆}, δ(w^{⋆},κ(w^{⋆})) can be infinite,
making it a noninformative dualitygap. The reasons for this can be one of the following:

The term ψ^{∗}(−X∇f(X^{⊤}w)/ λ) might have an infinite value.
 Intercepts make the problem more complicated. One can write the formulation with an intercept by
adding a row to X filled with the value 1, add one dimension to the vector w, and consider
a regularization function ψ that does regularize the last entry of
w. This further complexifies the computation of ψ^{∗} and its
definition, as shown in the next section.
Let us now detail how we proceed to solve these problems, but first without considering the intercept.
The analysis is similar when working with matrices W instead of vectors w.
A.0.1 Duality Gaps without Intercepts
Let us show how to compute the Fenchel conjugate of the functions we have introduced.
We now present the Fenchel conjugate of the loss functions f.

The square loss

f^{∗}(κ)=   κ_{2}^{2} + κ^{⊤}y.



 The logistic loss

f^{∗}(κ)=  ⎧
⎪
⎪
⎨
⎪
⎪
⎩  +∞ if ∃ i∈ [ 1;n ] s.t. y_{i}κ_{i} ∉ [−1,0], 
  (1+y_{i}κ_{i})log(1+y_{i}κ_{i})−y_{i}κ_{i}log(−y_{i}κ_{i}) otherwise.






 The multiclass logistic loss (or softmax). The primal variable is now a matrix Z, in ℝ^{n × r}, which represents the product X^{⊤}W. We denote by K the dual variable in ℝ^{n × r}.
f(Z)=    log  ⎛
⎜
⎝   e^{ Zij − Ziyi}  ⎞
⎟
⎠ 

f^{∗}(K)=  ⎧
⎪
⎪
⎨
⎪
⎪
⎩  +∞ if ∃ i ∈[ 1;n ] s.t. { K_{ij} < 0 and j ≠ y_{i} } or K_{iyi} < −1, 
   ⎡
⎢
⎣   K_{ij}log(K_{ij}) + (1+K_{i yi})log(1+K_{i yi})  ⎤
⎥
⎦  .






Our first remark is that the choice Eq. (49) ensures
that f(κ) is not infinite.
As for the regularization function, except for the Tikhonov regularization
which is selfconjugate (it is equal to its Fenchel conjugate), we have
considered functions that are norms. There exists therefore a norm . such
that ψ(w)=w, and we denote by ._{∗} its dualnorm. In such a
case, the Fenchel conjugate of ψ for a vector γ in ℝ^{p} takes the
form
ψ^{∗}(γ) =  ⎧
⎨
⎩  0  if γ_{∗}≤ 1, 
+∞  otherwise.



It turns out that for almost all the norms we have presented, there exists (i)
either a closed form for the dualnorm or (ii) there exists an
efficient algorithm evaluating it. The only one which does not conform to this
statement is the treestructured sum of ℓ_{2}norms, for which we do not know
how to evaluate it efficiently.
One can now slightly modify the definition of κ
to ensure that ψ^{∗}(−Xκ/λ) ≠
+∞. A natural choice is
κ(w)  
min  ⎛
⎜
⎝  1,   ⎞
⎟
⎠  ∇
f(X^{⊤}w),

which is the one we have implemented. With this new choice, it is easy to see
that for all vectors w in ℝ^{p}, we still have f^{∗}(κ) ≠ + ∞, and
finally, we also have δ(w,κ(w)) < + ∞ and
δ(w^{⋆},κ(w^{⋆}))=0, making it potentially a good
duality gap.
A.0.2 Duality Gaps with Intercepts
Even though adding an intercept does seem a simple modification to the original
problem, it induces difficulties for finding good dual variables.
We recall that having an intercept is equivalent to having a problem of the
type (48), by adding a row to X filled with the value
1, adding one dimension to the vector w (or one row for matrices W),
and by using a regularization function that does not depend on the last entry
of w (or the last row of W).
Suppose that we are considering a problem of
type (48) of dimension p+1, but we are using a
regularization function ψ: ℝ^{p+1} → ℝ, such that for a
vector w in ℝ^{p+1}, ψ(w) =^{▵} ψ(w_{[ 1;p ]}),
where ψ: ℝ^{p} → ℝ is one of the regularization function we have
introduced. Then, considering a primal variable w, a dual variable κ,
and writing γ=^{▵}−Xκ/λ, we are interested in computing
ψ^{∗}(γ) =  ⎧
⎨
⎩  +∞ if γ_{p+1} ≠ 0 
ψ^{∗}(γ_{[ 1;p ]}) otherwise,



which means that in order the duality gap not to be infinite, one needs in addition to ensure
that γ_{p+1} be zero. Since the last row of X is filled with ones, this writes
down ∑_{i=1}^{p+1} κ_{i}=0.
For the formulation with matrices W and K, the constraint is similar but for every
column of K.
Let us now detail how we proceed for every loss function to find a “good”
dual variable κ satisfying this additional constraint, given a primal
variable w in ℝ^{p+1}, we first define the auxiliary function
(which becomes K′(W)=^{▵} ∇f(X^{⊤}W) for matrices),
and then define another auxiliary function κ″(w) as follows,
to take into account the additional constraint ∑_{i=1}^{p+1} κ_{i}=0.
When the function ψ is the Tykhonov regularization function, we end the process by setting κ(w)=κ″(w).
When it is a norm, we choose, as before for taking into account the constraint Xκ_{∗}≤ λ,
κ(w)   min  ⎛
⎜
⎝  1,   ⎞
⎟
⎠  κ″(w),

with a similar formulation for matrices W and K.
Even though finding dual variables while taking into account the intercept
requires quite a lot of engineering, notably implementing a quadratic knapsack
solver, it can be done efficiently.