Knowledge-Guided Wasserstein Distributionally Robust Optimization (2025)

Zitao WangDepartment of Statistics, Columbia University.Ziyuan WangDepartment of Industrial Engineering and Management Sciences, Northwestern University.Molei LiuThese authors contributed equally to this work.Department of Biostatistics, Columbia Mailman School of Public Health.Nian SiDepartment of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology.

Abstract

Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.

Keywords: Wasserstein distributionally robust optimization; Knowledge-guided learning; Difference-of-convex optimization; Shrinkage-based transfer learning.

1 Introduction

Traditional machine learning methods or empirical risk minimization often suffer from overfitting and a lack of generalization power, particularly in high-dimensional and small-sample-size settings. In recent years, distributionally robust optimization (DRO) has emerged as a powerful framework for mitigating the effects of model misspecification and enhancing robustness in machine learning generalizations. Among various DRO formulations, Wasserstein-DRO (WDRO) gained more attention due to its tractability and generalizability. Specifically, in WDRO, one optimizes over worst-case distributions within an ambiguity set defined by a Wasserstein ball centered at an empirical measure.

However, one persistent challenge with WDRO is its tendency to be overly conservative, which can lead to suboptimal performance in practice as found in Liu etal., (2024). In many real-world scenarios, prior knowledge can be leveraged to improve model performance and robustness. For example, in electronic healthcare record data, prior knowledge might come from predictive models trained on existing large, population-wide datasets. In such a context, transfer learning has proven to be a versatile approach for improving performance on a target task. Despite its successes, the integration of prior knowledge into WDRO frameworks has remained an open question.

In this work, we introduce Knowledge-Guided Wasserstein Distributionally Robust Optimization (KG-WDRO), a novel framework that adapts the Wasserstein ambiguity set using external knowledge (parameters). We assume access to prior predictors of pre-trained models, which can guide the predictive model in the target dataset. By constraining the transport cost along directions informed by prior knowledge, our approach addresses the conservativeness of vanilla WDRO while preserving robustness. Intuitively, this strategy allows the model to focus its uncertainty on regions where prior knowledge is less reliable, effectively robustify knowledge-guided generalization.

1.1 Related Works

1.1.1 Wasserstein DRO

Wasserstein DRO has recently garnered significant attention due to its tractability (Blanchet and Murthy,, 2019; MohajerinEsfahani and Kuhn,, 2018; Gao and Kleywegt,, 2023) and generalizability (Blanchet etal., 2019a, ; Gao etal.,, 2022). Notably, Blanchet etal., 2019a and Gao etal., (2022) demonstrate that Wasserstein DRO with mean square loss is equivalent to the square root lasso Belloni etal., (2011). Similarly, Shafieezadeh-Abadeh etal., (2015, 2019); Blanchet etal., 2019a ; Gao etal., (2022) establish that Wasserstein DRO with logistic loss and hinge loss corresponds to their regularized counterparts. Moreover, the statistical properties of the WDRO estimator have also been investigated in Blanchet etal., (2021, 2022); Gao, (2023). However, leveraging external knowledge in Wasserstein DRO has been an open problem.

1.1.2 Transfer Learning

Improving prediction accuracy for target populations by integrating diverse source datasets has driven methodological advances in transfer learning. Contemporary approaches aim to address challenges including distributional heterogeneity and limited labeled target data. A common assumption is that the target outcome model aligns partially with source models, enabling knowledge transfer. For example, recent frameworks employ selective parameter reduction to identify transferable sources and sparse or ridge shrinkage to leverage their knowledge (Bastani,, 2020; Li etal.,, 2021; Tian and Feng,, 2023). Subsequent works tackle covariate distribution mismatches and semi-supervised scenarios, enhancing robustness when labeled target data is scarce (Cai etal.,, 2024; He etal.,, 2024; Zhou etal.,, 2024). Further innovations include geometric or profile-based adaptations, where the target model is represented as a weighted combination of source coefficients (Gu etal.,, 2024; Lin etal.,, 2024).

Methods Ridge-type Lasso-type ScaleAdjustment Continuousoutcome Binaryoutcome PartialTransfer Multi-Sourceensemble
KG-WDRO\checkmark\checkmark\checkmark\checkmark\checkmark\checkmark\checkmark
Bastani, (2020)\checkmark\checkmark
Li etal., (2021)\checkmark\checkmark
Tian and Feng, (2023)\checkmark\checkmark\checkmark
Gu etal., (2024)\checkmark\checkmark\checkmark\checkmark
Lin etal., (2024)\checkmark\checkmark\checkmark\checkmark

1.2 Our Contribution

Our contributions are fourfold. Framework: We introduce KG-WDRO, a principled and flexible framework that integrates prior knowledge into WDRO for linear regression and binary classification. This framework mitigates the conservativeness of standard WDRO, enables automated covariate scaling adjustments, and prevents negative transfer. Theory: We establish the equivalence between KG-WDRO and shrinkage-based estimation methods, offering a novel perspective that unifies and interprets a broad range of knowledge transfer learning approaches through the lens of distributional robustness. Table 1 provides an overview of them, highlighting their key capabilities and advantages and comparing them with our framework. Technicalities: Leveraging Toland’s Duality (Theorem A6), we reformulate the innermost maximization in WDRO’s strong duality (Proposition 1) into a univariate optimization problem (B). This reformulation enhances tractability while accommodating more general cost functions. Empirical Validation: Through extensive experiments, we demonstrate the effectiveness of KG-WDRO in improving small-sample transfer learning.

Below is an overview of our main results for the linear regression case.

Example 1.

Suppose θ𝜃\thetaitalic_θ is an accessible prior predictor for a linear model parameterized with β𝛽\betaitalic_β. We show that the shrinkage-based transfer-learning regression problem, which estimates a target predictor β𝛽\betaitalic_β by solving

infβ,κ𝐲𝐗β2+δβκθp,subscriptinfimum𝛽𝜅subscriptnorm𝐲𝐗𝛽2𝛿subscriptnorm𝛽𝜅𝜃𝑝\inf_{\beta,\kappa}\|\mathbf{y}-\mathbf{X}\beta\|_{2}+\sqrt{\delta}\|\beta-%\kappa\theta\|_{p},roman_inf start_POSTSUBSCRIPT italic_β , italic_κ end_POSTSUBSCRIPT ∥ bold_y - bold_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + square-root start_ARG italic_δ end_ARG ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ,

can be interpreted as a Wasserstein distributionally robust optimization (WDRO) problem of the form (WDRO), where the loss function is least squares,(X,Y;β)=(Yβ𝖳X)2,𝑋𝑌𝛽superscript𝑌superscript𝛽𝖳𝑋2\ell(X,Y;\beta)=(Y-\beta^{\scriptscriptstyle\sf T}X)^{2},roman_ℓ ( italic_X , italic_Y ; italic_β ) = ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , and the ambiguity set δ(N;c2,)subscript𝛿subscript𝑁subscript𝑐2\mathcal{B}_{\delta}(\mathbb{P}_{N};c_{2,\infty})caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ) is defined as a ball around the empirical measure. The cost function c2,subscript𝑐2c_{2,\infty}italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT augments the standard transport cost by the constraint x𝖳θ=u𝖳θsuperscript𝑥𝖳𝜃superscript𝑢𝖳𝜃x^{\scriptscriptstyle\sf T}\theta=u^{\scriptscriptstyle\sf T}\thetaitalic_x start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_θ = italic_u start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_θ so that

c2,((x,y),(u,v))=xuq2+|yv|+|(xu)𝖳θ|.subscript𝑐2𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢𝑞2𝑦𝑣superscript𝑥𝑢𝖳𝜃\displaystyle c_{2,\infty}\big{(}(x,y),(u,v)\big{)}=\|x-u\|_{q}^{2}+\infty%\cdot|y-v|+\infty\cdot|(x-u)^{\scriptscriptstyle\sf T}\theta|.italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | + ∞ ⋅ | ( italic_x - italic_u ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_θ | .

This establishes a distributionally robust optimization (DRO) perspective on a broad class of transfer-learning methods as will be discussed in Section 3.

1.3 Notations & Organizations

We summarize the mathematical notations used in this work. The positive integers N𝑁Nitalic_N, M𝑀Mitalic_M, and d𝑑ditalic_d denote, respectively, the target sample size, the number of sources, and the dimension of the support of the covariate X𝑋Xitalic_X. The integers p𝑝pitalic_p and q[1,]𝑞1q\in[1,\infty]italic_q ∈ [ 1 , ∞ ] are reserved for pairs of Hölder conjugates, satisfying p1+q1=1superscript𝑝1superscript𝑞11p^{-1}+q^{-1}=1italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 for p,q(1,)𝑝𝑞1p,q\in(1,\infty)italic_p , italic_q ∈ ( 1 , ∞ ), as well as the pair 1111 and \infty. For a distribution \mathbb{P}blackboard_P supported on the Euclidean space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we use Nsubscript𝑁\mathbb{P}_{N}blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to denote the empirical measure of \mathbb{P}blackboard_P with sample size N𝑁Nitalic_N. In modeling the target-covariate relationship, the distribution is often factorized as =Y|X×Xsuperscriptconditional𝑌𝑋superscript𝑋\mathbb{P}=\mathbb{P}^{Y|X}\times\mathbb{P}^{X}blackboard_P = blackboard_P start_POSTSUPERSCRIPT italic_Y | italic_X end_POSTSUPERSCRIPT × blackboard_P start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. For a vector vd𝑣superscript𝑑v\in\mathbb{R}^{d}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, vpsubscriptnorm𝑣𝑝\|v\|_{p}∥ italic_v ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denotes the p𝑝pitalic_p-norm, where p[1,]𝑝1p\in[1,\infty]italic_p ∈ [ 1 , ∞ ], and v𝖳superscript𝑣𝖳v^{\scriptscriptstyle\sf T}italic_v start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT denote the transpose of v𝑣vitalic_v. For any two vectors u,vd𝑢𝑣superscript𝑑u,v\in\mathbb{R}^{d}italic_u , italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the notation cos(u,v)𝑢𝑣\cos{(u,v)}roman_cos ( italic_u , italic_v ) denote the cosine of the angle between u𝑢uitalic_u and v𝑣vitalic_v, calculated by cos(u,v)u2v2=u𝖳v𝑢𝑣subscriptnorm𝑢2subscriptnorm𝑣2superscript𝑢𝖳𝑣\cos{(u,v)}\|u\|_{2}\|v\|_{2}=u^{\scriptscriptstyle\sf T}vroman_cos ( italic_u , italic_v ) ∥ italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_v. All vectors are assumed to be column vectors. Other specialized notations are defined in context as needed.

The remainder of the paper is organized as follows. Section 2 provides a review of the WDRO framework, including the strong duality result. In Section 3, we introduce our KG-WDRO framework and demonstrate its equivalence to shrinkage-based estimations in both linear regression and binary classification. Section 4 presents comprehensive results from our numerical simulations. All proofs and detailed descriptions of the numerical simulation setups are provided in the appendix.

2 Preliminaries

We first begin with a short overview of the distributionally robust framework on statistical learning.

2.1 Optimal Transport Cost

Let \mathbb{P}blackboard_P and \mathbb{Q}blackboard_Q denote two probability distributions supported on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and we use 𝒫(d×d)𝒫superscript𝑑superscript𝑑\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d})caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) to label the set of all probability measures on the product space d×dsuperscript𝑑superscript𝑑\mathbb{R}^{d}\times\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We say that an element π𝒫(d×d)𝜋𝒫superscript𝑑superscript𝑑\pi\in\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d})italic_π ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) has first marginal \mathbb{P}blackboard_P and second marginal \mathbb{Q}blackboard_Q if

π(A×d)=(A),π(d×B)=(B),formulae-sequence𝜋𝐴superscript𝑑𝐴𝜋superscript𝑑𝐵𝐵\pi(A\times\mathbb{R}^{d})=\mathbb{P}(A),\,\,\,\pi(\mathbb{R}^{d}\times B)=%\mathbb{Q}(B),italic_π ( italic_A × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) = blackboard_P ( italic_A ) , italic_π ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × italic_B ) = blackboard_Q ( italic_B ) ,

for all Borel measurable sets A,Bd𝐴𝐵superscript𝑑A,B\in\mathbb{R}^{d}italic_A , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The class of all such measures π𝜋\piitalic_π is collected as Π(,)Π\Pi(\mathbb{P},\mathbb{Q})roman_Π ( blackboard_P , blackboard_Q ), and is called the set of transport plans, which is always non-empty. Choose a non-negative, lower semi-continuous function c:d×d[0,]:𝑐superscript𝑑superscript𝑑0c:\mathbb{R}^{d}\times\mathbb{R}^{d}\to[0,\infty]italic_c : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , ∞ ] such that c(u,v)=0𝑐𝑢𝑣0c(u,v)=0italic_c ( italic_u , italic_v ) = 0 whenever u=v𝑢𝑣u=vitalic_u = italic_v, then the Kantorovich’s formulation of optimal transport is defined as

𝒟c(,)infπΠ(,)𝔼π[c(U,V)].subscript𝒟𝑐subscriptinfimum𝜋Πsubscript𝔼𝜋delimited-[]𝑐𝑈𝑉\mathcal{D}_{c}(\mathbb{P},\mathbb{Q})\coloneqq\inf_{\pi\in\Pi(\mathbb{P},%\mathbb{Q})}\mathbb{E}_{\pi}\left[c(U,V)\right].caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≔ roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( blackboard_P , blackboard_Q ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c ( italic_U , italic_V ) ] .

It is well-known that (Villani,, 2009, Theorem 4.1) there exists an optimal coupling πsuperscript𝜋\pi^{\dagger}italic_π start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT that solves the Kantorovich’s problem infπΠ(,)𝔼π[c(U,V)]subscriptinfimum𝜋Πsubscript𝔼𝜋delimited-[]𝑐𝑈𝑉\inf_{\pi\in\Pi(\mathbb{P},\mathbb{Q})}\mathbb{E}_{\pi}\left[c(U,V)\right]roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( blackboard_P , blackboard_Q ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c ( italic_U , italic_V ) ]. Intuitively, we may think of the value c(u,v)𝑐𝑢𝑣c(u,v)italic_c ( italic_u , italic_v ) as the cost of transferring one unit of mass from ud𝑢superscript𝑑u\in\mathbb{R}^{d}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to vd𝑣superscript𝑑v\in\mathbb{R}^{d}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, then 𝔼π[c(U,V)]subscript𝔼𝜋delimited-[]𝑐𝑈𝑉\mathbb{E}_{\pi}[c(U,V)]blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c ( italic_U , italic_V ) ] gives the average cost of transferring under the plan π𝜋\piitalic_π. The optimal transport cost 𝒟c(,)subscript𝒟𝑐\mathcal{D}_{c}(\mathbb{P},\mathbb{Q})caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) gives a measure of discrepancy between probability distributions on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

If c(u,v)𝑐𝑢𝑣c(u,v)italic_c ( italic_u , italic_v ) defines a metric on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, then for any p[1,)𝑝1p\in[1,\infty)italic_p ∈ [ 1 , ∞ ) the optimal transport cost,

𝒟c1/p(,)(infπΠ(,)𝔼π[c(U,V)p])1/p,superscriptsubscript𝒟𝑐1𝑝superscriptsubscriptinfimum𝜋Πsubscript𝔼𝜋delimited-[]𝑐superscript𝑈𝑉𝑝1𝑝\mathcal{D}_{c}^{1/p}(\mathbb{P},\mathbb{Q})\coloneqq\left(\inf_{\pi\in\Pi(%\mathbb{P},\mathbb{Q})}\mathbb{E}_{\pi}\left[c(U,V)^{p}\right]\right)^{1/p},caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ( blackboard_P , blackboard_Q ) ≔ ( roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( blackboard_P , blackboard_Q ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c ( italic_U , italic_V ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ,

defines a metric between probability distributions and metrizes weak convergence under moment assumptions. It is called the p𝑝pitalic_p-Wasserstein distance. We direct the interested readers to (Villani,, 2009, Chapter 6) for more details. It is worth mentioning that none of our judiciously chosen cost functions qualify as metrics on the support of the data.

2.2 Distributionally Robust Optimization

In standard statistical learning framework, one generally assumes that the target-covariate pair (X,Y)d×d+1𝑋𝑌superscript𝑑superscript𝑑1(X,Y)\in\mathbb{R}^{d}\times\mathbb{R}\cong\mathbb{R}^{d+1}( italic_X , italic_Y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R ≅ blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT follows a data-generating distribution X,Ysubscript𝑋𝑌\mathbb{P}\coloneqq\mathbb{P}_{X,Y}blackboard_P ≔ blackboard_P start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT on the support d+1superscript𝑑1\mathbb{R}^{d+1}blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT. One then seeks to find a ‘best’ parameter β𝛽\betaitalic_β that relates Y𝑌Yitalic_Y to X𝑋Xitalic_X through a parameterized model by solving the stochastic optimization,

infβ𝔼[(X,Y;β)].subscriptinfimum𝛽subscript𝔼delimited-[]𝑋𝑌𝛽\inf_{\beta}\mathbb{E}_{\mathbb{P}}\left[\ell{(X,Y;\beta})\right].roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ] .

The loss function (x,y;β)𝑥𝑦𝛽\ell{(x,y;\beta)}roman_ℓ ( italic_x , italic_y ; italic_β ) provides a quantification of the goodness-of-fit in the parameter β𝛽\betaitalic_β given the realized observation (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). Since only samples {(xi,yi)}i=1,,Nsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},y_{i})\}_{i=1,\ldots,N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 , … , italic_N end_POSTSUBSCRIPT are observed, we can typically only solve the empirical objective,

infβ𝔼N[(X,Y;β)]=infβ1Ni=1N(xi,yi;β).subscriptinfimum𝛽subscript𝔼subscript𝑁delimited-[]𝑋𝑌𝛽subscriptinfimum𝛽1𝑁superscriptsubscript𝑖1𝑁subscript𝑥𝑖subscript𝑦𝑖𝛽\inf_{\beta}\mathbb{E}_{\mathbb{P}_{N}}[\ell(X,Y;\beta)]=\inf_{\beta}\dfrac{1}%{N}\sum_{i=1}^{N}\ell(x_{i},y_{i};\beta).roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ] = roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) .

Therefore the distribution \mathbb{P}blackboard_P that underlies the data-generating mechanism is uncertain to the decision-maker. This motivated the distributionally robust optimization (DRO) framework, which entails solving the following minimax stochastic program:

infβsup𝒫amb𝔼[(X,Y;β)],subscriptinfimum𝛽subscriptsupremumsubscript𝒫ambsubscript𝔼delimited-[]𝑋𝑌𝛽\inf_{\beta}\sup_{\mathbb{P}\in\mathcal{P}_{\rm amb}}\mathbb{E}_{\mathbb{P}}[%\ell(X,Y;\beta)],roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P ∈ caligraphic_P start_POSTSUBSCRIPT roman_amb end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ] ,

where the ambiguity set 𝒫ambsubscript𝒫amb\mathcal{P}_{\rm amb}caligraphic_P start_POSTSUBSCRIPT roman_amb end_POSTSUBSCRIPT represents a class of probability measures supported on d+1superscript𝑑1\mathbb{R}^{d+1}blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT that are candidates to the true data-generating distributions. In Wasserstein-DRO, the ambiguity set is constructed by forming a ‘δ𝛿\deltaitalic_δ-ball’ around the canonical empirical measure Nsubscript𝑁\mathbb{P}_{N}blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT associated to the decision-maker-defined transport cost c𝑐citalic_c, i.e. we let the ambiguity set 𝒫ambsubscript𝒫amb\mathcal{P}_{\rm amb}caligraphic_P start_POSTSUBSCRIPT roman_amb end_POSTSUBSCRIPT be chosen as:

δ(N;c){𝒫(d+1)|𝒟c(,N)δ}.subscript𝛿subscript𝑁𝑐conditional-set𝒫superscript𝑑1subscript𝒟𝑐subscript𝑁𝛿\displaystyle\mathcal{B}_{\delta}(\mathbb{P}_{N};c)\coloneqq\{\mathbb{P}\in%\mathcal{P}(\mathbb{R}^{d+1})|\mathcal{D}_{c}(\mathbb{P},\mathbb{P}_{N})\leq%\delta\}.caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c ) ≔ { blackboard_P ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT ) | caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ≤ italic_δ } .(WDRO)

This ambiguity set captures probability measures that are close to the observed empirical measure in the transport cost 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which may be taken as a class of candidates of measures perturbed from Nsubscript𝑁\mathbb{P}_{N}blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. The solution βDROsubscript𝛽DRO\beta_{\rm DRO}italic_β start_POSTSUBSCRIPT roman_DRO end_POSTSUBSCRIPT to (2.2) that solves the worst case expected loss should perform well over the entire set of perturbations in the ambiguity set. This is in contrast to βERMsubscript𝛽ERM\beta_{\rm ERM}italic_β start_POSTSUBSCRIPT roman_ERM end_POSTSUBSCRIPT that solves (2.2) only performs well on the training samples. This adds a robustness layer to the WDRO problem (WDRO). For a comprehensive overview of different constructions of ambiguity sets, we direct the interested reader to (Kuhn etal.,, 2024, Section 2).

2.3 Strong Duality of Wasserstein DRO

The Wasserstein DRO problem involves an inner maximization over an infinite-dimensional set, which appears computationally intractable. However, the distribution nsubscript𝑛\mathbb{P}_{n}blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is discrete, strong duality of the Wasserstein DRO reformulates it as a simple univariate optimization.

Proposition 1 (Strong Duality, (Blanchet etal., 2019a, , Proposition 1)).

Let c:d+1×d+1[0,]:𝑐superscript𝑑1superscript𝑑10c:\mathbb{R}^{d+1}\times\mathbb{R}^{d+1}\to[0,\infty]italic_c : blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT → [ 0 , ∞ ] be a lower semi-continuous cost function satisfying c((x,y),(u,v))=0𝑐𝑥𝑦𝑢𝑣0c\big{(}(x,y),(u,v)\big{)}=0italic_c ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = 0 whenever (x,y)=(u,v)𝑥𝑦𝑢𝑣(x,y)=(u,v)( italic_x , italic_y ) = ( italic_u , italic_v ). Then the distributionally robust regression problem

infβdsup:δ(N)𝔼[(X,Y;β)],subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿subscript𝑁subscript𝔼delimited-[]𝑋𝑌𝛽\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta}(\mathbb{P}_%{N})}\mathbb{E}_{\mathbb{P}}\left[\ell(X,Y;\beta)\right],roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ] ,

is equivalent to,

infβdinfγ0{γδ+1ni=1Nϕγ(xi,yi;β)},subscriptinfimum𝛽superscript𝑑subscriptinfimum𝛾0𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscriptitalic-ϕ𝛾subscript𝑥𝑖subscript𝑦𝑖𝛽\inf_{\beta\in\mathbb{R}^{d}}\inf_{\gamma\geq 0}\left\{\gamma\delta+\dfrac{1}{%n}\sum_{i=1}^{N}\phi_{\gamma}(x_{i},y_{i};\beta)\right\},roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_γ ≥ 0 end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) } ,

where ϕγ(xi,yi;β)subscriptitalic-ϕ𝛾subscript𝑥𝑖subscript𝑦𝑖𝛽\phi_{\gamma}(x_{i},y_{i};\beta)italic_ϕ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) is given by,

sup(u,v)d+1{(u,v;β)γc((u,v),(xi,yi))}.subscriptsupremum𝑢𝑣superscript𝑑1𝑢𝑣𝛽𝛾𝑐𝑢𝑣subscript𝑥𝑖subscript𝑦𝑖\sup_{(u,v)\in\mathbb{R}^{d+1}}\big{\{}\ell(u,v;\beta)-\gamma c\big{(}(u,v),(x%_{i},y_{i})\big{)}\big{\}}.roman_sup start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { roman_ℓ ( italic_u , italic_v ; italic_β ) - italic_γ italic_c ( ( italic_u , italic_v ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } .

For more general results, see (Blanchet and Murthy,, 2019, Theorem 1) and (Gao etal.,, 2022, Section 2). The exchangeability of supsupremum\suproman_sup and infinfimum\infroman_inf in Wasserstein-DRO is also established by (Blanchet etal., 2019a, , Lemma 1).

3 Knowledge-Guided Wasserstein DRO

In this section, we propose new cost functions for the Wasserstein DRO framework that leverage prior knowledge for transfer learning. For linear regression and binary classification, these cost functions act as regularizers, encouraging collinearity with prior knowledge.

3.1 Knowledge-Guided Transport Cost

It is shown in (Blanchet etal., 2019a, , Theorem 1) that using the squared q𝑞qitalic_q-norm on the covariates as the cost function

c2((x,y),(u,v))=xuq2+|yv|,subscript𝑐2𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢𝑞2𝑦𝑣c_{2}\big{(}(x,y),(u,v)\big{)}=\|x-u\|_{q}^{2}+\infty\cdot|y-v|,italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | ,(1)

equates Wasserstein distributionally robust linear regression with p𝑝pitalic_p-norm regularization on the root mean squared error (RMSE). The cost function c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT perturbs only the observed covariates {xi}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\{x_{i}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, while keeping the observed targets {yi}i=1Nsuperscriptsubscriptsubscript𝑦𝑖𝑖1𝑁\{y_{i}\}_{i=1}^{N}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT fixed. Keeping the observed target Y𝑌Yitalic_Y as fixed often leads to more mathematically tractable reformulation, another intuition is that we trust the mechanism by which the target Y𝑌Yitalic_Y is generated once X𝑋Xitalic_X is known.

In the presence of prior knowledge θ𝜃\thetaitalic_θ that may aid in inferring β𝛽\betaitalic_β, we aim to control the extent of perturbation along the direction of θ𝜃\thetaitalic_θ.

Specifically, we constrain the size of the prediction discrepancy θ𝖳xθ𝖳u=θ𝖳Δsuperscript𝜃𝖳𝑥superscript𝜃𝖳𝑢superscript𝜃𝖳Δ\theta^{\scriptscriptstyle\sf T}x-\theta^{\scriptscriptstyle\sf T}u=\theta^{%\scriptscriptstyle\sf T}\Deltaitalic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u = italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ, where ΔxuΔ𝑥𝑢\Delta\coloneqq x-uroman_Δ ≔ italic_x - italic_u.To achieve this goal, we henceforth augment the cost function c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with an additional penalty term that accounts for the size of the perturbation in the direction of θ𝜃\thetaitalic_θ:

c2,λ((x,y),(u,v))=Δq2+|yv|+λh(|θ𝖳Δ|),subscript𝑐2𝜆𝑥𝑦𝑢𝑣superscriptsubscriptnormΔ𝑞2𝑦𝑣𝜆superscript𝜃𝖳Δ\displaystyle c_{2,\lambda}\big{(}(x,y),(u,v)\big{)}=\|\Delta\|_{q}^{2}+\infty%\cdot|y-v|+\lambda h(|\theta^{\scriptscriptstyle\sf T}\Delta|),italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | + italic_λ italic_h ( | italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | ) ,(2)

where λ>0𝜆0\lambda>0italic_λ > 0 and h(x):+{0}:𝑥superscript0h(x):\mathbb{R}\to\mathbb{R}^{+}\cup\{0\}italic_h ( italic_x ) : blackboard_R → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∪ { 0 } is a non-negative, monotone increasing function of |x|𝑥|x|| italic_x | such that h(0)=000h(0)=0italic_h ( 0 ) = 0. Recall that in the cost function c2()subscript𝑐2c_{2}(\cdot)italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ), the targets y𝑦yitalic_y remain fixed. Intuitively, the new cost function (2) encourages the Wasserstein ambiguity set to include distributions whose marginals inX𝑋Xitalic_X generate predictions that align with the data based on the prior predictor θ𝜃\thetaitalic_θ. The parameter λ𝜆\lambdaitalic_λ controls the level of confidence in the prior knowledge. We call this kind of cost functions knowledge-guided. Since c2,λsubscript𝑐2𝜆c_{2,\lambda}italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT upper bounds the cost function c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have δ(NX;c2,λ2)δ(NX;c2,λ1)δ(NX;c2)subscript𝛿subscriptsuperscript𝑋𝑁subscript𝑐2subscript𝜆2subscript𝛿subscriptsuperscript𝑋𝑁subscript𝑐2subscript𝜆1subscript𝛿subscriptsuperscript𝑋𝑁subscript𝑐2\mathcal{B}_{\delta}(\mathbb{P}^{X}_{N};c_{2,\lambda_{2}})\subseteq\mathcal{B}%_{\delta}(\mathbb{P}^{X}_{N};c_{2,\lambda_{1}})\subseteq\mathcal{B}_{\delta}(%\mathbb{P}^{X}_{N};c_{2})caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⊆ caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⊆ caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) whenever λ2>λ1subscript𝜆2subscript𝜆1\lambda_{2}>\lambda_{1}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

The corresponding optimal transport problem given by:

infπΠ(X,NX)𝔼π[c2,λ(X,U)],subscriptinfimum𝜋Πsuperscript𝑋superscriptsubscript𝑁𝑋subscript𝔼𝜋delimited-[]subscript𝑐2𝜆𝑋𝑈\inf_{\pi\in\Pi(\mathbb{Q}^{X},\mathbb{P}_{N}^{X})}\mathbb{E}_{\pi}[c_{2,%\lambda}(X,U)],roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( blackboard_Q start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ( italic_X , italic_U ) ] ,

can also be expressed as:

infπΠ(X,NX)𝔼π[c2(X,U)]+λ𝔼π[h(|θ𝖳Δ|)].subscriptinfimum𝜋Πsuperscript𝑋superscriptsubscript𝑁𝑋subscript𝔼𝜋delimited-[]subscript𝑐2𝑋𝑈𝜆subscript𝔼𝜋delimited-[]superscript𝜃𝖳Δ\inf_{\pi\in\Pi(\mathbb{Q}^{X},\mathbb{P}_{N}^{X})}\mathbb{E}_{\pi}[c_{2}(X,U)%]+\lambda\mathbb{E}_{\pi}[h(|\theta^{\scriptscriptstyle\sf T}\Delta|)].roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( blackboard_Q start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X , italic_U ) ] + italic_λ blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_h ( | italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | ) ] .

This formulation regularizes the original optimal transport problem by penalizing large values of the expectation 𝔼π[h(|θ𝖳Δ|)]subscript𝔼𝜋delimited-[]superscript𝜃𝖳Δ\mathbb{E}_{\pi}[h(|\theta^{\scriptscriptstyle\sf T}\Delta|)]blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_h ( | italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | ) ].

For any user-defined function hhitalic_h that measures the discrepancy in generalization with respect to the prior knowledge θ𝜃\thetaitalic_θ, we refer to it as weak-transferring of knowledge if λ<+𝜆\lambda<+\inftyitalic_λ < + ∞, and strong-transferring of knowledge if λ=+𝜆\lambda=+\inftyitalic_λ = + ∞. In the case of strong-transferring, to ensure the finiteness of the optimal transport problem, the minimizing transport plan πsuperscript𝜋\pi^{\dagger}italic_π start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT must satisfy the orthogonality condition θ𝖳Δ=0superscript𝜃𝖳Δ0\theta^{\scriptscriptstyle\sf T}\Delta=0italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ = 0, πsuperscript𝜋\pi^{\dagger}italic_π start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT-almost surely. Consequently, the value of θ𝖳Xsuperscript𝜃𝖳𝑋\theta^{\scriptscriptstyle\sf T}Xitalic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X remains unchanged after perturbing NXsuperscriptsubscript𝑁𝑋\mathbb{P}_{N}^{X}blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT within δ(NX;c2,)subscript𝛿superscriptsubscript𝑁𝑋subscript𝑐2\mathcal{B}_{\delta}(\mathbb{P}_{N}^{X};c_{2,\infty})caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ). As a result, this should promote βDROθsubscript𝛽DRO𝜃\beta_{\mathrm{DRO}}\to\thetaitalic_β start_POSTSUBSCRIPT roman_DRO end_POSTSUBSCRIPT → italic_θ as δ𝛿\delta\to\inftyitalic_δ → ∞.

Remark 1.

The above framework extends to incorporate multi-sites prior knowledge, meaning that instead of a single prior knowledge coefficient θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we consider a set of coefficients {θ1,θ2,,θM}subscript𝜃1subscript𝜃2subscript𝜃𝑀\{\theta_{1},\theta_{2},\ldots,\theta_{M}\}{ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. Let Θspan{θ1,θ2,,θM}Θspansubscript𝜃1subscript𝜃2subscript𝜃𝑀\Theta\coloneqq\operatorname{span}\{\theta_{1},\theta_{2},\ldots,\theta_{M}\}roman_Θ ≔ roman_span { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } represent the linear span of these prior knowledge coefficients. In the case of strong-transferring, we must ensure that rank(Θ)<drankΘ𝑑\operatorname{rank}(\Theta)<droman_rank ( roman_Θ ) < italic_d; otherwise, the set of orthogonality conditions {θm𝖳Δ=0;m[M]}formulae-sequencesuperscriptsubscript𝜃𝑚𝖳Δ0𝑚delimited-[]𝑀\{\theta_{m}^{\scriptscriptstyle\sf T}\Delta=0;m\in[M]\}{ italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ = 0 ; italic_m ∈ [ italic_M ] } would imply that the perturbation ΔΔ\Deltaroman_Δ is identically zero (Δ=𝟎Δ0\Delta=\mathbf{0}roman_Δ = bold_0). This would render the ambiguity set redundant and reduce the WDRO problem (WDRO) to the ERM problem (2.2). This result is confirmed by the statements of Theorems 1 and 3.

3.2 Linear Regression

We begin by examining the WDRO problem (WDRO) for linear regression within the strong-transferring domain. Following this, we present a specific case within the weak-transferring domain. Let Θspan{θ1,,θM}Θspansubscript𝜃1subscript𝜃𝑀\Theta\coloneqq\operatorname{span}\{\theta_{1},\ldots,\theta_{M}\}roman_Θ ≔ roman_span { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } represent the linear span of the prior knowledge.

3.2.1 Strong-Transferring

Define the cost function c2,((x,y),(u,v))xuq2+|yv|+|θ1𝖳xθ1𝖳u|++|θM𝖳xθM𝖳u|subscript𝑐2𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢𝑞2𝑦𝑣superscriptsubscript𝜃1𝖳𝑥superscriptsubscript𝜃1𝖳𝑢superscriptsubscript𝜃𝑀𝖳𝑥superscriptsubscript𝜃𝑀𝖳𝑢c_{2,\infty}\big{(}(x,y),(u,v)\big{)}\coloneqq\|x-u\|_{q}^{2}+\infty\cdot|y-v|%+\infty\cdot|\theta_{1}^{\scriptscriptstyle\sf T}x-\theta_{1}^{%\scriptscriptstyle\sf T}u|+\ldots+\infty\cdot|\theta_{M}^{\scriptscriptstyle%\sf T}x-\theta_{M}^{\scriptscriptstyle\sf T}u|italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) ≔ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | + ∞ ⋅ | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u | + … + ∞ ⋅ | italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u |, and for a set of observed samples {(xi,yi)}i[N]subscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖delimited-[]𝑁\{(x_{i},y_{i})\}_{i\in[N]}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT, we use MSEN(β)N1i=1N(yiβ𝖳xi)2subscriptMSE𝑁𝛽superscript𝑁1superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖2{\rm MSE}_{N}(\beta)\coloneqq N^{-1}\sum_{i=1}^{N}(y_{i}-\beta^{%\scriptscriptstyle\sf T}x_{i})^{2}roman_MSE start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) ≔ italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Without making any additional distributional assumptions on (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), we obtain the following finite-dimensional representation.

Theorem 1 (Linear Regression with Strong-Transferring).

Consider the least-squared loss (X,Y;β)=(Yβ𝖳X)2𝑋𝑌𝛽superscript𝑌superscript𝛽𝖳𝑋2\ell(X,Y;\beta)=(Y-\beta^{\scriptscriptstyle\sf T}X)^{2}roman_ℓ ( italic_X , italic_Y ; italic_β ) = ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then for any q[1,]𝑞1q\in[1,\infty]italic_q ∈ [ 1 , ∞ ] we have

infβdsup:δ(N;c2,)𝔼[(Yβ𝖳X)2]subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿subscript𝑁subscript𝑐2subscript𝔼delimited-[]superscript𝑌superscript𝛽𝖳𝑋2\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta%}(\mathbb{P}_{N};c_{2,\infty})}\mathbb{E}_{\mathbb{P}}\left[(Y-\beta^{%\scriptscriptstyle\sf T}X)^{2}\right]roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle==infβd,ϑΘ(MSEN(β)+δβϑp)2,subscriptinfimumformulae-sequence𝛽superscript𝑑italic-ϑΘsuperscriptsubscriptMSE𝑁𝛽𝛿subscriptnorm𝛽italic-ϑ𝑝2\displaystyle\inf_{\beta\in\mathbb{R}^{d},\vartheta\in\Theta}\left(\sqrt{{\rmMSE%}_{N}(\beta)}+\sqrt{\delta}\|\beta-\vartheta\|_{p}\right)^{2},roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ( square-root start_ARG roman_MSE start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) end_ARG + square-root start_ARG italic_δ end_ARG ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where p𝑝pitalic_p is such that p1+q1=1superscript𝑝1superscript𝑞11p^{-1}+q^{-1}=1italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1.

From the above result, we observe that the knowledge-guided WDRO problem for linear regression is equivalent to regularizing the RMSE with a p𝑝pitalic_p-norm distance to the linear span ΘΘ\Thetaroman_Θ. The regularization parameter is entirely determined by the size (or radius) of the Wasserstein ambiguity set. Importantly, the penalty term focuses on the collinearity with the prior knowledge rather than their algebraic difference or angular proximity.

Consider the case when there is only a single prior knowledge θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the penalty term does not constrain the solution βDROsubscript𝛽DRO\beta_{\rm DRO}italic_β start_POSTSUBSCRIPT roman_DRO end_POSTSUBSCRIPT to be close to θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but rather to κθ1𝜅subscript𝜃1\kappa\cdot\theta_{1}italic_κ ⋅ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for some κ𝜅\kappa\in\mathbb{R}italic_κ ∈ blackboard_R to be optimized. Consequently, this knowledge transfer automatically robustify solution against scaling of covariates. Furthermore, it can prevent negative transfer by adapting its sign to be positively correlated with βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which is the solution to population objective (2.2). When δ𝛿\delta\to\inftyitalic_δ → ∞, the penalty term becomes dominant, forcing β𝛽\betaitalic_β to lie in ΘΘ\Thetaroman_Θ for any p1𝑝1p\geq 1italic_p ≥ 1. This reduces the WDRO problem to a simple constrained regression problem,

infβΘMSEN(β),subscriptinfimum𝛽ΘsubscriptMSE𝑁𝛽\inf_{\beta\in\Theta}{\rm MSE}_{N}(\beta),roman_inf start_POSTSUBSCRIPT italic_β ∈ roman_Θ end_POSTSUBSCRIPT roman_MSE start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) ,

reflecting the complete reliance on the prior knowledge and prevents excessive shrinkage towards the null estimator.

Remark 2.

We now discuss two special cases of the penalty term, p=2𝑝2p=2italic_p = 2 (ridge-type regularization) and p=1𝑝1p=1italic_p = 1 (lasso-type regularization). For simplicity, we consider the case of a single prior knowledge vector θ𝜃\thetaitalic_θ.

Ridge-type. The penalty term can be explicitly calculated as

minκβκθ2=ββ𝖳θθ22θ2=βθ2,subscript𝜅subscriptnorm𝛽𝜅𝜃2subscriptnorm𝛽superscript𝛽𝖳𝜃superscriptsubscriptnorm𝜃22𝜃2subscriptnormsuperscript𝛽perpendicular-toabsent𝜃2\min_{\kappa\in\mathbb{R}}\|\beta-\kappa\theta\|_{2}=\left\|\beta-\dfrac{\beta%^{\scriptscriptstyle\sf T}\theta}{\|\theta\|_{2}^{2}}\theta\right\|_{2}=\|%\beta^{\perp\theta}\|_{2},roman_min start_POSTSUBSCRIPT italic_κ ∈ blackboard_R end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_β - divide start_ARG italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_θ end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where βθsuperscript𝛽perpendicular-toabsent𝜃\beta^{\perp\theta}italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT is the component of β𝛽\betaitalic_β orthogonal to θ𝜃\thetaitalic_θ. This penalty term shrinks distance to the line in the direction of θ𝜃\thetaitalic_θ. Furthermore, note that

βθ2=β2sin(β,θ)=β21cos2(β,θ),subscriptnormsuperscript𝛽perpendicular-toabsent𝜃2subscriptnorm𝛽2𝛽𝜃subscriptnorm𝛽21superscript2𝛽𝜃\|\beta^{\perp\theta}\|_{2}=\|\beta\|_{2}\sin(\beta,\theta)=\|\beta\|_{2}\sqrt%{1-\cos^{2}(\beta,\theta)},∥ italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin ( italic_β , italic_θ ) = ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG 1 - roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β , italic_θ ) end_ARG ,

which represents a trade-off between the magnitude of β𝛽\betaitalic_β and its angular proximity to the prior knowledge θ𝜃\thetaitalic_θ. This trade-off is illustrated in the leftmost figure of Fig.1, drawing the feasibility set of the regularization as a constraint. This regularization is closely related but different to the one proposed in Gu etal., (2024), where they penalize large values of a computational relaxation of sin(β,θ)𝛽𝜃\sin{(\beta,\theta)}roman_sin ( italic_β , italic_θ ).

Lasso-type. When the prior knowledge θθ\thetaitalic_θ is sparse, the penalty term minκβκθ1subscriptκsubscriptnormβκθ1\min_{\kappa}\|\beta-\kappa\theta\|_{1}roman_min start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT promotes sparse representation learning. Consider a simple example where the dimension is d=3d3d=3italic_d = 3 and θ=(1,0,0)𝖳θsuperscript100𝖳\theta=(1,0,0)^{\scriptscriptstyle\sf T}italic_θ = ( 1 , 0 , 0 ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT. In this case, we have:

minκβκθ1subscript𝜅subscriptnorm𝛽𝜅𝜃1\displaystyle\min_{\kappa}\|\beta-\kappa\theta\|_{1}roman_min start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=minκ(|β1κ|+|β2|+|β3|)absentsubscript𝜅subscript𝛽1𝜅subscript𝛽2subscript𝛽3\displaystyle=\min_{\kappa}\big{(}|\beta_{1}-\kappa|+|\beta_{2}|+|\beta_{3}|%\big{)}= roman_min start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( | italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_κ | + | italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | + | italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | )
=|β2|+|β3|β11,absentsubscript𝛽2subscript𝛽3subscriptnormsubscript𝛽11\displaystyle=|\beta_{2}|+|\beta_{3}|\eqqcolon\|\beta_{\shortminus 1}\|_{1},= | italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | + | italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | ≕ ∥ italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where β1=(β2,β3)𝖳subscript𝛽1superscriptsubscript𝛽2subscript𝛽3𝖳\beta_{\shortminus 1}=(\beta_{2},\beta_{3})^{\scriptscriptstyle\sf T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT. This formulation enforces sparsity only on the last two components of β𝛽\betaitalic_β, reflecting the sparsity pattern of θ𝜃\thetaitalic_θ.

Knowledge-Guided Wasserstein Distributionally Robust Optimization (1)

Knowledge-Guided Wasserstein Distributionally Robust Optimization (2)

Knowledge-Guided Wasserstein Distributionally Robust Optimization (3)

3.2.2 Weak Transferring

For the special case of q=p=2𝑞𝑝2q=p=2italic_q = italic_p = 2, we define the weak-transferring cost function c2,λ((x,y),(u,v))=xu22+λ(θ𝖳xθ𝖳u)2+|yv|subscript𝑐2𝜆𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢22𝜆superscriptsuperscript𝜃𝖳𝑥superscript𝜃𝖳𝑢2𝑦𝑣c_{2,\lambda}\big{(}(x,y),(u,v)\big{)}=\|x-u\|_{2}^{2}+\lambda(\theta^{%\scriptscriptstyle\sf T}x-\theta^{\scriptscriptstyle\sf T}u)^{2}+\infty\cdot|y%-v|italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ( italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | with 0<λ<+0𝜆0<\lambda<+\infty0 < italic_λ < + ∞. Here, we select h(x)=x2𝑥superscript𝑥2h(x)=x^{2}italic_h ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as the user-defined function on controlling the size of perturbation in θ𝜃\thetaitalic_θ. For simplicity, we consider a single prior knowledge vector θ𝜃\thetaitalic_θ in this setup. This result can be straightforwardly extended to a multi-source setup with different values of λ𝜆\lambdaitalic_λ’s.

Theorem 2 (Linear Regression with Weak Transferring).

Consider the least-squared loss (X,Y;β)=(Yβ𝖳X)2𝑋𝑌𝛽superscript𝑌superscript𝛽𝖳𝑋2\ell(X,Y;\beta)=(Y-\beta^{\scriptscriptstyle\sf T}X)^{2}roman_ℓ ( italic_X , italic_Y ; italic_β ) = ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then for p=q=2𝑝𝑞2p=q=2italic_p = italic_q = 2 we have

infβdsup:δ(N;c2,λ)𝔼[(Yβ𝖳X)2]subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿subscript𝑁subscript𝑐2𝜆subscript𝔼delimited-[]superscript𝑌superscript𝛽𝖳𝑋2\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta%}(\mathbb{P}_{N};c_{2,\lambda})}\mathbb{E}_{\mathbb{P}}\left[(Y-\beta^{%\scriptscriptstyle\sf T}X)^{2}\right]roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle==infβd(MSEN(β)+δβΨλ1)2.subscriptinfimum𝛽superscript𝑑superscriptsubscriptMSE𝑁𝛽𝛿subscriptnorm𝛽subscriptΨsuperscript𝜆12\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\left(\sqrt{{\rm MSE}_{N}(\beta)}+%\sqrt{\delta}\left\|\beta\right\|_{\Psi_{\lambda^{\shortminus 1}}}\right)^{2}.roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( square-root start_ARG roman_MSE start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) end_ARG + square-root start_ARG italic_δ end_ARG ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

With Ψλ=Id1θ22+λθθ𝖳subscriptΨ𝜆subscript𝐼𝑑1superscriptsubscriptnorm𝜃22𝜆𝜃superscript𝜃𝖳\Psi_{\lambda}=I_{d}-\dfrac{1}{\|\theta\|_{2}^{2}+\lambda}\theta\theta^{%\scriptscriptstyle\sf T}roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ end_ARG italic_θ italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT and βΨλ2=β𝖳Ψλβsuperscriptsubscriptnorm𝛽subscriptΨ𝜆2superscript𝛽𝖳subscriptΨ𝜆𝛽\|\beta\|_{\Psi_{\lambda}}^{2}=\beta^{\scriptscriptstyle\sf T}\Psi_{\lambda}\beta∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_β.

Write Pλ=θθ𝖳/(θ22+λ)P_{\lambda}={\theta\theta^{\scriptscriptstyle\sf T}}/({\lVert\theta\|_{2}^{2}+%\lambda})italic_P start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_θ italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT / ( ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ), we note that as λ𝜆\lambda\to\inftyitalic_λ → ∞, we have Pλ1P0=θθ𝖳/θ22subscript𝑃superscript𝜆1subscript𝑃0𝜃superscript𝜃𝖳superscriptsubscriptnorm𝜃22P_{\lambda^{\shortminus 1}}\to P_{0}={\theta\theta^{\scriptscriptstyle\sf T}}/%{\|\theta\|_{2}^{2}}italic_P start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT → italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_θ italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT / ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT recovering the projection matrix onto the prior knowledge θ𝜃\thetaitalic_θ. Consequently, βΨλ1βθ2subscriptnorm𝛽subscriptΨsuperscript𝜆1subscriptnormsuperscript𝛽perpendicular-toabsent𝜃2\|\beta\|_{\Psi_{\lambda^{\shortminus 1}}}\to\|\beta^{\perp\theta}\|_{2}∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT → ∥ italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We observe that the action

Pλ1β=β𝖳θθ22+λ1θsubscript𝑃superscript𝜆1𝛽superscript𝛽𝖳𝜃superscriptsubscriptnorm𝜃22superscript𝜆1𝜃P_{\lambda^{\shortminus 1}}\beta=\dfrac{\beta^{\scriptscriptstyle\sf T}\theta}%{\|\theta\|_{2}^{2}+\lambda^{-1}}\thetaitalic_P start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_β = divide start_ARG italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_θ end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG italic_θ

is exactly the ridge regression of β𝛽\betaitalic_β onto θ𝜃\thetaitalic_θ with a regularization parameter λ1superscript𝜆1\lambda^{-1}italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Thus, the finiteness of λ𝜆\lambdaitalic_λ, which can reflect a caution in the prior knowledge θ𝜃\thetaitalic_θ, induces a shrinkage effect on the component of β𝛽\betaitalic_β explainable by θ𝜃\thetaitalic_θ in the dot product geometry. Since ΨλIdPsucceedssubscriptΨ𝜆subscript𝐼𝑑𝑃\Psi_{\lambda}\succ I_{d}-Proman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ≻ italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_P, we have βΨλ1>βθ2subscriptnorm𝛽subscriptΨsuperscript𝜆1subscriptnormsuperscript𝛽perpendicular-toabsent𝜃2\|\beta\|_{\Psi_{\lambda^{\shortminus 1}}}>\|\beta^{\perp\theta}\|_{2}∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT > ∥ italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for any finite λ>0𝜆0\lambda>0italic_λ > 0, this implies the inclusion of feasibility set

{β:βΨλ1K}{β:βθ2K},conditional-set𝛽subscriptnorm𝛽subscriptΨsuperscript𝜆1𝐾conditional-set𝛽subscriptnormsuperscript𝛽perpendicular-toabsent𝜃2𝐾\{\beta:\|\beta\|_{\Psi_{\lambda^{\shortminus 1}}}\leq K\}\subset\{\beta:\|%\beta^{\perp\theta}\|_{2}\leq K\},{ italic_β : ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_K } ⊂ { italic_β : ∥ italic_β start_POSTSUPERSCRIPT ⟂ italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_K } ,

as plotted in Fig.1 for an illustration on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The contour {β2:βΨλ1=K}conditional-set𝛽superscript2subscriptnorm𝛽subscriptΨsuperscript𝜆1𝐾\{\beta\in\mathbb{R}^{2}:\|\beta\|_{\Psi_{\lambda^{\shortminus 1}}}=K\}{ italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_K } forms an ellipse centered around the origin 𝟎0\boldsymbol{0}bold_0. The ellipse has a major axis of half length Kθ22+λ1λ1𝐾superscriptsubscriptnorm𝜃22superscript𝜆1superscript𝜆1K\sqrt{\dfrac{\|\theta\|_{2}^{2}+\lambda^{-1}}{\lambda^{-1}}}italic_K square-root start_ARG divide start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG aligned with the direction of θ𝜃\thetaitalic_θ, and a minor axis with half-length K𝐾Kitalic_K aligned with the direction of θsuperscript𝜃perpendicular-to\theta^{\perp}italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. As λ0𝜆0\lambda\to 0italic_λ → 0, representing no-confidence in θ𝜃\thetaitalic_θ, the half-length of the major axis converges to K𝐾Kitalic_K, resulting in a perfect circle as in ridge regression.

The two-dimensional hyper-parameters (δ,λ1)𝛿superscript𝜆1(\delta,\lambda^{-1})( italic_δ , italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) enable the use of data-driven methods, such as grid-search cross-validation, for hyper-parameter tuning. Unlike the strong-transferring domain, the inclusion of λ1superscript𝜆1\lambda^{-1}italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT allows the data to self-determine the informativeness of the source samples.

3.3 Binary Classifications

In this section, we focus on the context of binary classification, where the goal is to predict the discrete label Y{1,1}𝑌11Y\in\{-1,1\}italic_Y ∈ { - 1 , 1 } based on the covariates Xd𝑋superscript𝑑X\in\mathbb{R}^{d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Unlike the previous section, we use the q𝑞qitalic_q-norm, rather than its square, to account for distributional ambiguity in the covariate distribution. Define the strong-transferring cost function c1,((x,y),(u,v))xuq+|yv|+|θ1𝖳xθ1𝖳u|++|θM𝖳xθM𝖳u|.subscript𝑐1𝑥𝑦𝑢𝑣subscriptnorm𝑥𝑢𝑞𝑦𝑣superscriptsubscript𝜃1𝖳𝑥superscriptsubscript𝜃1𝖳𝑢superscriptsubscript𝜃𝑀𝖳𝑥superscriptsubscript𝜃𝑀𝖳𝑢c_{1,\infty}\big{(}(x,y),(u,v)\big{)}\coloneqq\|x-u\|_{q}+\infty\cdot|y-v|+%\infty\cdot|\theta_{1}^{\scriptscriptstyle\sf T}x-\theta_{1}^{%\scriptscriptstyle\sf T}u|+\ldots+\infty\cdot|\theta_{M}^{\scriptscriptstyle%\sf T}x-\theta_{M}^{\scriptscriptstyle\sf T}u|.italic_c start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) ≔ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + ∞ ⋅ | italic_y - italic_v | + ∞ ⋅ | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u | + … + ∞ ⋅ | italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u | . We consider two loss functions here. The logistic loss function is given by

(X,Y;β)=log(1+eYβ𝖳X),𝑋𝑌𝛽1superscript𝑒𝑌superscript𝛽𝖳𝑋\ell(X,Y;\beta)=\log{\left(1+e^{-Y\beta^{\scriptscriptstyle\sf T}X}\right)},roman_ℓ ( italic_X , italic_Y ; italic_β ) = roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) ,

which is the negative log-likelihood of the model that postulates

log(Y=1|X=x)(Y=1|X=x)=βx𝖳.𝑌conditional1𝑋𝑥𝑌conditional1𝑋𝑥superscript𝛽superscript𝑥𝖳\log{\dfrac{\mathbb{P}(Y=1|X=x)}{\mathbb{P}(Y=-1|X=x)}}=\beta^{*}{{}^{%\scriptscriptstyle\sf T}}x.roman_log divide start_ARG blackboard_P ( italic_Y = 1 | italic_X = italic_x ) end_ARG start_ARG blackboard_P ( italic_Y = - 1 | italic_X = italic_x ) end_ARG = italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x .

The hinge loss is given by

(X,Y;β)=(1Yβ𝖳X)+,𝑋𝑌𝛽superscript1𝑌superscript𝛽𝖳𝑋\ell{(X,Y;\beta)}=(1-Y\beta^{\scriptscriptstyle\sf T}X)^{+},roman_ℓ ( italic_X , italic_Y ; italic_β ) = ( 1 - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ,

which is typically used for training classifiers that look for ‘maximum-margins’ in class boundaries, most notably support vector machines.

Suppose Y{1,1}𝑌11Y\in\{-1,1\}italic_Y ∈ { - 1 , 1 } is binary and without any distributional assumptions on X𝑋Xitalic_X, we have the following result which recovers regularized logistic regressions and support vector machines.

Theorem 3 (Binary Classification with Strong Transferring).

Suppose the loss function (X,Y;β)𝑋𝑌𝛽\ell(X,Y;\beta)roman_ℓ ( italic_X , italic_Y ; italic_β ) is either the logistic loss log(1+eYβ𝖳X)1superscript𝑒𝑌superscript𝛽𝖳𝑋\log{\left(1+e^{-Y\beta^{\scriptscriptstyle\sf T}X}\right)}roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) or the hinge loss (1Yβ𝖳X)+superscript1𝑌superscript𝛽𝖳𝑋(1-Y\beta^{\scriptscriptstyle\sf T}X)^{+}( 1 - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, then for any q[1,]𝑞1q\in[1,\infty]italic_q ∈ [ 1 , ∞ ] we have

infβdsup:δ(N;c1,)𝔼[(X,Y;β)]subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿subscript𝑁subscript𝑐1subscript𝔼delimited-[]𝑋𝑌𝛽\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta%}(\mathbb{P}_{N};c_{1,\infty})}\mathbb{E}_{\mathbb{P}}\left[\ell(X,Y;\beta)\right]roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ]
=\displaystyle==infβd,ϑΘ1Ni=1N(xi,yi;β)+δβϑp,subscriptinfimumformulae-sequence𝛽superscript𝑑italic-ϑΘ1𝑁superscriptsubscript𝑖1𝑁subscript𝑥𝑖subscript𝑦𝑖𝛽𝛿subscriptnorm𝛽italic-ϑ𝑝\displaystyle\inf_{\beta\in\mathbb{R}^{d},\vartheta\in\Theta}\dfrac{1}{N}\sum_%{i=1}^{N}\ell(x_{i},y_{i};\beta)+\delta\|\beta-\vartheta\|_{p},roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) + italic_δ ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ,

where p𝑝pitalic_p is such that p1+q1=1superscript𝑝1superscript𝑞11p^{-1}+q^{-1}=1italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1.

3.4 Sub-Coefficient-Vector Transferring

In this subsection, we generalize the statements of Theorems 1 and 3 for p=2𝑝2p=2italic_p = 2 to arbitrary norms induced by positive-definite quadratic forms. Let Λd×dΛsuperscript𝑑𝑑\Lambda\in\mathbb{R}^{d\times d}roman_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT be a positive-definite symmetric matrix. The norm xΛ=x𝖳Λxsubscriptnorm𝑥Λsuperscript𝑥𝖳Λ𝑥\|x\|_{\Lambda}=\sqrt{x^{\scriptscriptstyle\sf T}\Lambda x}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT = square-root start_ARG italic_x start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Λ italic_x end_ARG induces a metric on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, defined as dΛ(x,u)=xuΛsubscript𝑑Λ𝑥𝑢subscriptnorm𝑥𝑢Λd_{\Lambda}(x,u)=\|x-u\|_{\Lambda}italic_d start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ( italic_x , italic_u ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT, known as the Mahalanobis distance. Since ΛΛ\Lambdaroman_Λ is positive definite, it admits a decomposition Λ=Γ𝖳ΓΛsuperscriptΓ𝖳Γ\Lambda=\Gamma^{\scriptscriptstyle\sf T}\Gammaroman_Λ = roman_Γ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Γ with ΓΓ\Gammaroman_Γ invertible, and the norm xΛ=Γx2subscriptnorm𝑥ΛsubscriptnormΓ𝑥2\|x\|_{\Lambda}=\|\Gamma x\|_{2}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT = ∥ roman_Γ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT measures length in the geometry distorted by ΓΓ\Gammaroman_Γ. By (Blanchet etal., 2019b, , Lemma 1), the dual norm of Λ\|\cdot\|_{\Lambda}∥ ⋅ ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT is Λ1\|\cdot\|_{\Lambda^{-1}}∥ ⋅ ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Using Proposition A4, the statements of Theorems 1 and 3 can be easily generalized. Define the space of positive-definite symmetric matrices as 𝕊+d×dsuperscriptsubscript𝕊𝑑𝑑\mathbb{S}_{+}^{d\times d}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and the cost function: c2,Λ((x,y),(u,v))xuΛ2+|yv|+m=1M|θm𝖳xθm𝖳u|superscriptsubscript𝑐2Λ𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢Λ2𝑦𝑣superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳𝑥superscriptsubscript𝜃𝑚𝖳𝑢c_{2,\infty}^{\Lambda}\big{(}(x,y),(u,v)\big{)}\coloneqq\|x-u\|_{\Lambda}^{2}+%\infty\cdot|y-v|+\infty\cdot\sum_{m=1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T%}x-\theta_{m}^{\scriptscriptstyle\sf T}u|italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) ≔ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | + ∞ ⋅ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u |.

Corollary 1 (Theorem 1).

For the least-squares loss (X,Y;β)=(Yβ𝖳X)2𝑋𝑌𝛽superscript𝑌superscript𝛽𝖳𝑋2\ell(X,Y;\beta)=(Y-\beta^{\scriptscriptstyle\sf T}X)^{2}roman_ℓ ( italic_X , italic_Y ; italic_β ) = ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and any Λ𝕊+d×dΛsuperscriptsubscript𝕊𝑑𝑑\Lambda\in\mathbb{S}_{+}^{d\times d}roman_Λ ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT:

infβdsup:δ(N;c2,Λ)𝔼[(Yβ𝖳X)2]subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿subscript𝑁superscriptsubscript𝑐2Λsubscript𝔼delimited-[]superscript𝑌superscript𝛽𝖳𝑋2\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta%}(\mathbb{P}_{N};c_{2,\infty}^{\Lambda})}\mathbb{E}_{\mathbb{P}}\left[(Y-\beta%^{\scriptscriptstyle\sf T}X)^{2}\right]roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle==infβd,ϑΘ(MSEN(β)+δβϑΛ1)2.subscriptinfimumformulae-sequence𝛽superscript𝑑italic-ϑΘsuperscriptsubscriptMSE𝑁𝛽𝛿subscriptnorm𝛽italic-ϑsuperscriptΛ12\displaystyle\inf_{\beta\in\mathbb{R}^{d},\vartheta\in\Theta}\left(\sqrt{{\rmMSE%}_{N}(\beta)}+\sqrt{\delta}\|\beta-\vartheta\|_{\Lambda^{-1}}\right)^{2}.roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ( square-root start_ARG roman_MSE start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) end_ARG + square-root start_ARG italic_δ end_ARG ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

This formulation enables the use of metric learning methods to determine ΛΛ\Lambdaroman_Λ directly from the data, as detailed in Blanchet etal., 2019b . For example, if the two-dimensional prior θ=[θ1,θ2]𝜃subscript𝜃1subscript𝜃2\theta=[\theta_{1},\theta_{2}]italic_θ = [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] is known to primarily influence the first component of the truth β=[θ1+ϵ,β2]𝛽subscript𝜃1italic-ϵsubscript𝛽2\beta=[\theta_{1}+\epsilon,\beta_{2}]italic_β = [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], we can select Λ=diag(d1,d2)Λdiagsubscript𝑑1subscript𝑑2\Lambda=\mathrm{diag}(d_{1},d_{2})roman_Λ = roman_diag ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with d1d2much-less-thansubscript𝑑1subscript𝑑2d_{1}\ll d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This imposes a weaker penalty on perturbations in the first direction, resulting in a weighted penalty term: minκ((β1κθ1)/d1+(β2κθ2)/d2)subscript𝜅subscript𝛽1𝜅subscript𝜃1subscript𝑑1subscript𝛽2𝜅subscript𝜃2subscript𝑑2\min_{\kappa}\big{(}{(\beta_{1}-\kappa\theta_{1})}/{d_{1}}+{(\beta_{2}-\kappa%\theta_{2})}/{d_{2}}\big{)}roman_min start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_κ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), which prioritizes aligning β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is determined more flexibly based on the data. We call this sub-coefficient-vector transferring, or the ability to partially transfer prior knowledge. A similar corollary applies to Theorem 3, as stated in Corollary 2.

Finally, we again draw the reader’s attention to Table 1, which compares several transfer learning methods discussed in Section 1.1.2. Notably, our proposed KG-WDRO framework brings together a broad range of desirable capabilities within a single, unified approach to transfer learning.

4 Numerical Results

Knowledge-Guided Wasserstein Distributionally Robust Optimization (4)

In this section, we present numerical simulations to validate the effectiveness of the proposed KG-WDRO method. We compare learners across different settings, including high-dimensional sparse models, correlated covariates, and multi-source prior knowledge, for either linear regression or binary classification tasks. Performance is evaluated using out-of-sample classification error for binary classifiers and out-of-sample R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for linear regressors.

For the single-source experiments, target-source coefficient pairs (β,θ)𝛽𝜃(\beta,\theta)( italic_β , italic_θ ) are generated from a multivariate normal distribution:

(βj,θj)N((00),(σ2ρσ2ρσ2σ2)),similar-tosubscript𝛽𝑗subscript𝜃𝑗𝑁matrix00matrixsuperscript𝜎2𝜌superscript𝜎2𝜌superscript𝜎2superscript𝜎2(\beta_{j},\theta_{j})\sim N\left(\begin{pmatrix}0\\0\end{pmatrix},\begin{pmatrix}\sigma^{2}&\rho\sigma^{2}\\\rho\sigma^{2}&\sigma^{2}\end{pmatrix}\right),( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∼ italic_N ( ( start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ) , ( start_ARG start_ROW start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_ρ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ρ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ) ,

where ρ𝜌\rhoitalic_ρ is the correlation between β𝛽\betaitalic_β and θ𝜃\thetaitalic_θ, and the expected length of θ𝜃\thetaitalic_θ is approximately σd0.5𝜎𝑑0.5\sigma\sqrt{d-0.5}italic_σ square-root start_ARG italic_d - 0.5 end_ARG. We scale β𝛽\betaitalic_β as βsβ𝛽𝑠𝛽\beta\leftarrow s\betaitalic_β ← italic_s italic_β with s(0,1]𝑠01s\in(0,1]italic_s ∈ ( 0 , 1 ] to study the stabilizing effects of strong prior knowledge in small-sample settings. The dimension-to-sample ratio d/N𝑑𝑁d/Nitalic_d / italic_N is varied by fixing d𝑑ditalic_d and increasing N𝑁Nitalic_N. Performance is averaged over 100 simulations. Each dataset consists of three parts: data = (train, val, test). The (train, val) pair shares the same size, and hyperparameters are selected based on validation performance. The source data contains 800 samples, with source truth θ𝜃\thetaitalic_θ estimated accordingly. Out-of-sample performance is measured on the test set of 5000 data points.

4.1 Simulation 1: Logistic with 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Strong Transferring

In the first experiment, we compare two learners for binary classification tasks with high-dimensional sparse coefficients against our proposed KG-WDRO learner, βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT, derived using Theorem 3 with p=1𝑝1p=1italic_p = 1. The competing learners are the target-only vanilla WDRO learner βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT (Blanchet etal., 2019a, , Theorem 2) and βTransGsubscript𝛽TransG\beta_{\rm TransG}italic_β start_POSTSUBSCRIPT roman_TransG end_POSTSUBSCRIPT, obtained via the 𝒜𝒜\mathcal{A}caligraphic_A-Trans-GLM algorithm (Tian and Feng,, 2023, Algorithm 1). The target-source pair (β,θ)𝛽𝜃(\beta,\theta)( italic_β , italic_θ ) is generated using (4) with a dimension of 50 and augmented with 100 zeros for sparsity, resulting in a total dimension of 150. We test six settings, varying the sample size N{20,50,80}𝑁205080N\in\{20,50,80\}italic_N ∈ { 20 , 50 , 80 }, signal strength s{0.5,1}𝑠0.51s\in\{0.5,1\}italic_s ∈ { 0.5 , 1 }, and truth-prior correlation ρ{0.3,0.5,0.7,0.8,0.9,0.95}𝜌0.30.50.70.80.90.95\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 }.

The comparison between βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT and βTransGsubscript𝛽TransG\beta_{\rm TransG}italic_β start_POSTSUBSCRIPT roman_TransG end_POSTSUBSCRIPT is highly competitive, with βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT consistently outperforming βTransGsubscript𝛽TransG\beta_{\rm TransG}italic_β start_POSTSUBSCRIPT roman_TransG end_POSTSUBSCRIPT by up to 2%percent22\%2 % in accuracy when the sample size is small (N=20𝑁20N=20italic_N = 20) across all values of ρ𝜌\rhoitalic_ρ, as shown in the upper-left plot of Figure 2. In larger sample size scenarios, both learners perform similarly (see Table A1 for detailed results). Both transfer learning methods, βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT and βTransGsubscript𝛽TransG\beta_{\rm TransG}italic_β start_POSTSUBSCRIPT roman_TransG end_POSTSUBSCRIPT, significantly outperform the target-only learner, βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT.

4.2 Simulation 2: Linear Regression with 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-Weak Transferring

In this simulation, we compare two learners on high-dimensional linear regression with correlated covariates against our proposed learners, βKGweaksubscript𝛽KGweak\beta_{\rm KGweak}italic_β start_POSTSUBSCRIPT roman_KGweak end_POSTSUBSCRIPT (Theorem 2) and βKGstrongsubscript𝛽KGstrong\beta_{\rm KGstrong}italic_β start_POSTSUBSCRIPT roman_KGstrong end_POSTSUBSCRIPT (Theorem 1), both using p=2𝑝2p=2italic_p = 2. There is no sparsity in the regression coefficients. The competing learners are the target-only vanilla WDRO learner βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT (Blanchet etal., 2019a, , Theorem 1) and the Trans-Ridge algorithm adapted from (Li etal.,, 2021, Algorithm 1), denoted as βTransRsubscript𝛽TransR\beta_{\rm TransR}italic_β start_POSTSUBSCRIPT roman_TransR end_POSTSUBSCRIPT. The covariates are fixed at dimension 100, with a pairwise correlation of 0.3. The experiment is conducted across six settings, varying the sample size N{50,70,90}𝑁507090N\in\{50,70,90\}italic_N ∈ { 50 , 70 , 90 }, signal strength s{0.8,1}𝑠0.81s\in\{0.8,1\}italic_s ∈ { 0.8 , 1 }, and truth-prior correlation ρ{0.3,0.5,0.7,0.8,0.9,0.95}𝜌0.30.50.70.80.90.95\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 }.

As shown in the upper-right plot of Figure 2, the performance of βTransRsubscript𝛽TransR\beta_{\rm TransR}italic_β start_POSTSUBSCRIPT roman_TransR end_POSTSUBSCRIPT lags significantly behind both βKGstrongsubscript𝛽KGstrong\beta_{\rm KGstrong}italic_β start_POSTSUBSCRIPT roman_KGstrong end_POSTSUBSCRIPT and βKGweaksubscript𝛽KGweak\beta_{\rm KGweak}italic_β start_POSTSUBSCRIPT roman_KGweak end_POSTSUBSCRIPT until the correlation ρ𝜌\rhoitalic_ρ becomes sufficiently high. Across all settings, βKGstrongsubscript𝛽KGstrong\beta_{\rm KGstrong}italic_β start_POSTSUBSCRIPT roman_KGstrong end_POSTSUBSCRIPT and βKGweaksubscript𝛽KGweak\beta_{\rm KGweak}italic_β start_POSTSUBSCRIPT roman_KGweak end_POSTSUBSCRIPT consistently outperform βTransRsubscript𝛽TransR\beta_{\rm TransR}italic_β start_POSTSUBSCRIPT roman_TransR end_POSTSUBSCRIPT when ρ𝜌\rhoitalic_ρ is moderate or low, as documented in Table A2. Furthermore, all three transfer learning methods demonstrate superior performance compared to the target-only learner, βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT.

4.3 Simulation 3: Transfer Learning with Multiple Sites

In the final set of experiments, we validate our methods in a multi-source transfer learning setting with high-dimensional sparse linear regression. The significant components of the three source coefficients are generated using (4) with correlation ϱitalic-ϱ\varrhoitalic_ϱ and dimension 50, denoted as {θ1,θ2,θ3}subscript𝜃1subscript𝜃2subscript𝜃3\{\theta_{1},\theta_{2},\theta_{3}\}{ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }. We construct a linear combination, θS=aθ1+bθ2+cθ3subscript𝜃𝑆𝑎subscript𝜃1𝑏subscript𝜃2𝑐subscript𝜃3\theta_{S}=a\theta_{1}+b\theta_{2}+c\theta_{3}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_a italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and generate β=ρθS+ε𝛽𝜌subscript𝜃𝑆𝜀\beta=\rho\theta_{S}+\varepsilonitalic_β = italic_ρ italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_ε, where εN(0,(1ρ2)Var(θS))similar-to𝜀𝑁01superscript𝜌2Varsubscript𝜃𝑆\varepsilon\sim N(0,(1-\rho^{2})\mathrm{Var}(\theta_{S}))italic_ε ∼ italic_N ( 0 , ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_Var ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ), ensuring Corr(β,θS)=ρCorr𝛽subscript𝜃𝑆𝜌\mathrm{Corr}(\beta,\theta_{S})=\rhoroman_Corr ( italic_β , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = italic_ρ. β𝛽\betaitalic_β is then scaled to match the magnitude of θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, and all vectors are augmented with 100 zeros, yielding a total dimension of 150. Our proposed method, βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT (Theorem 1, p=1𝑝1p=1italic_p = 1), is compared against the oracle Trans-Lasso algorithm (Li etal.,, 2021, Algorithm 1) (βTransLsubscript𝛽TransL\beta_{\rm TransL}italic_β start_POSTSUBSCRIPT roman_TransL end_POSTSUBSCRIPT) and the vanilla WDRO learner βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT. The experiment spans six settings: [a,b,c]=[1,0.5,0.2]𝑎𝑏𝑐10.50.2[a,b,c]=[1,-0.5,0.2][ italic_a , italic_b , italic_c ] = [ 1 , - 0.5 , 0.2 ] and [1,1,1]111[1,1,1][ 1 , 1 , 1 ], with ϱ=0.9italic-ϱ0.9\varrho=0.9italic_ϱ = 0.9 and 0.60.60.60.6, respectively. Sample sizes vary in N{50,60,70}𝑁506070N\in\{50,60,70\}italic_N ∈ { 50 , 60 , 70 }. The truth-prior correlation ranges in ρ{0.7,0.75,0.8,0.85,0.9,0.95}𝜌0.70.750.80.850.90.95\rho\in\{0.7,0.75,0.8,0.85,0.9,0.95\}italic_ρ ∈ { 0.7 , 0.75 , 0.8 , 0.85 , 0.9 , 0.95 }.

When [a,b,c]=[1,0.5,0.2]𝑎𝑏𝑐10.50.2[a,b,c]=[1,-0.5,0.2][ italic_a , italic_b , italic_c ] = [ 1 , - 0.5 , 0.2 ], the contributions of the θ𝜃\thetaitalic_θ’s to the generation of θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are unequal. In this case, it is not surprising that βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT outperforms βTransLsubscript𝛽TransL\beta_{\rm TransL}italic_β start_POSTSUBSCRIPT roman_TransL end_POSTSUBSCRIPT, as shown in the bottom-left plot of Figure 2. When θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is an equal-weighted average of the θ𝜃\thetaitalic_θ’s ([a,b,c]=[1,1,1]𝑎𝑏𝑐111[a,b,c]=[1,1,1][ italic_a , italic_b , italic_c ] = [ 1 , 1 , 1 ]), the performance of βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT and βTransLsubscript𝛽TransL\beta_{\rm TransL}italic_β start_POSTSUBSCRIPT roman_TransL end_POSTSUBSCRIPT becomes similar. However, βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT still demonstrates superior performance in larger sample sizes and higher correlations, as documented in Table A3.

5 Conclusion

We propose the knowledge-guided Wassersteindistributionally robust optimization (KG-WDRO) framework, which utilizes prior knowledge of predictors to mitigate the over-conservativeness of conventional DRO methods. We establish tractable reformulations and demonstrate their superior performance compared to other methods. For future work, we aim to provide statistical guarantees of our proposed estimators. Furthermore, based on these statistical properties, we plan to develop a principled approach for selecting hyperparameters such as δ𝛿\deltaitalic_δ and λ𝜆\lambdaitalic_λ.

References

  • Bastani, (2020)Bastani, H. (2020).Predicting with proxies: Transfer learning in high dimension.Management Science, 67(5):2964–2984.
  • Belloni etal., (2011)Belloni, A., Chernozhukov, V., and Wang, L. (2011).Square-root lasso: Pivotal recovery of sparse signals via conic programming.Biometrika, 98(4):791–806.
  • (3)Blanchet, J., Kang, Y., and Murthy, K. (2019a).Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857.
  • (4)Blanchet, J., Kang, Y., Murthy, K., and Zhang, F. (2019b).Data-driven optimal transport cost selection for distributionally robust optimization.In 2019 Winter Simulation Conference (WSC), pages 3740–3751.
  • Blanchet and Murthy, (2019)Blanchet, J. and Murthy, K. (2019).Quantifying distributional model risk via optimal transport.Mathematics of Operations Research, 44(2):565–600.
  • Blanchet etal., (2021)Blanchet, J., Murthy, K., and Nguyen, V.A. (2021).Statistical analysis of wasserstein distributionally robust estimators.In Tutorials in Operations Research: Emerging optimization methods and modeling techniques with applications, pages 227–254. INFORMS.
  • Blanchet etal., (2022)Blanchet, J., Murthy, K., and Si, N. (2022).Confidence regions in wasserstein distributionally robust estimation.Biometrika, 109(2):295–315.
  • Cai etal., (2024)Cai, T., Li, M., and Liu, M. (2024).Semi-supervised triply robust inductive transfer learning.Journal of the American Statistical Association, pages 1–14.
  • Gao, (2023)Gao, R. (2023).Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality.Operations Research, 71(6):2291–2306.
  • Gao etal., (2022)Gao, R., Chen, X., and Kleywegt, A.J. (2022).Wasserstein distributionally robust optimization and variation regularization.Operations Research, 72(3):1177–1191.
  • Gao and Kleywegt, (2023)Gao, R. and Kleywegt, A. (2023).Distributionally robust stochastic optimization with wasserstein distance.Mathematics of Operations Research, 48(2):603–655.
  • Gu etal., (2024)Gu, T., Han, Y., and Duan, R. (2024).Robust angle-based transfer learning in high dimensions.Journal of the Royal Statistical Society Series B: Statistical Methodology.
  • He etal., (2024)He, Z., Sun, Y., and Li, R. (2024).Transfusion: Covariate-shift robust transfer learning for high-dimensional regression.In International Conference on Artificial Intelligence and Statistics, pages 703–711. PMLR.
  • Kuhn etal., (2024)Kuhn, D., Shafiee, S., and Wiesemann, W. (2024).Distributionally robust optimization.arXiv preprint arXiv: 2411.02549.
  • Li etal., (2021)Li, S., Cai, T.T., and Li, H. (2021).Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149–173.
  • Lin etal., (2024)Lin, Z., Zhao, J., Wang, F., and Wang, H. (2024).Profiled transfer learning for high dimensional linear model.arXiv preprint arXiv:2406.00701.
  • Liu etal., (2024)Liu, J., Wang, T., Cui, P., and Namkoong, H. (2024).Rethinking distribution shifts: Empirical analysis and inductive modeling for tabular data.arXiv preprint arXiv: 2307.05284.
  • Luenberger and Ye, (2008)Luenberger, D.G. and Ye, Y. (2008).Linear and Nonlinear Programming.International Series in Operations Research & Management Science. Springer New York, NY.
  • MohajerinEsfahani and Kuhn, (2018)MohajerinEsfahani, P. and Kuhn, D. (2018).Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Programming, 171(1):115–166.
  • Rockafellar, (1970)Rockafellar, R.T. (1970).Convex Analysis.Princeton University Press, Princeton, N.J.
  • Shafieezadeh-Abadeh etal., (2015)Shafieezadeh-Abadeh, S., Esfahani, P.M., and Kuhn, D. (2015).Distributionally robust logistic regression.In Advances in Neural Information Processing Systems, volume28.
  • Shafieezadeh-Abadeh etal., (2019)Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P.M. (2019).Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68.
  • Tian and Feng, (2023)Tian, Y. and Feng, Y. (2023).Transfer learning under high-dimensional generalized linear models.Journal of the American Statistical Association, 118(544):2684–2697.
  • Toland, (1978)Toland, J.F. (1978).Duality in nonconvex optimization.Journal of Mathematical Analysis and Applications, 66(2):399–415.
  • Toland, (1979)Toland, J.F. (1979).A duality principle for non-convex optimisation and the calculus of variations.Archive for Rational Mechanics and Analysis, 71:41–61.
  • Villani, (2009)Villani, C. (2009).Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften.Springer Berlin, Heidelberg, 1 edition.
  • Zhou etal., (2024)Zhou, D., Li, M., Cai, T., and Liu, M. (2024).Model-assisted and knowledge-guided transfer regression for the underrepresented population.arXiv preprint arXiv: 2410.06484.

Appendix A Additional Details in Numerical Results

This section provides details to supplement Section 4. We outline the data-generating distributions for all three sets of experiments, the hyperparameter grids, and the learners used to identify prior knowledge. We present the exact numerical results for all three sets of experiments. Recall that the notation s(0,1]𝑠01s\in(0,1]italic_s ∈ ( 0 , 1 ] represents the signal strength of the true parameter β𝛽\betaitalic_β, which works by rescaling the magnitude of β𝛽\betaitalic_β such that βsβ𝛽𝑠𝛽\beta\leftarrow s\betaitalic_β ← italic_s italic_β. The notation d𝑑ditalic_d is the dimension of the covariates, and N𝑁Nitalic_N is the sample size. Finally, the symbol ρ𝜌\rhoitalic_ρ represents the correlation between the true β𝛽\betaitalic_β and the prior θ𝜃\thetaitalic_θ.

A.1 Simulation Results

A.1.1 Simulation 1: Logistic Regression

Settingρ=0.3𝜌0.3\rho=0.3italic_ρ = 0.3ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5ρ=0.7𝜌0.7\rho=0.7italic_ρ = 0.7ρ=0.8𝜌0.8\rho=0.8italic_ρ = 0.8ρ=0.9𝜌0.9\rho=0.9italic_ρ = 0.9ρ=0.95𝜌0.95\rho=0.95italic_ρ = 0.95WDRO
s=1𝑠1s=1italic_s = 1KG-WDRO0.5870.6470.7140.7480.7940.8170.565
N=20𝑁20N=20italic_N = 20Trans-GLM0.5850.6410.7020.7350.7780.800-
s=1𝑠1s=1italic_s = 1KG-WDRO0.5860.6470.7130.7510.7970.8230.619
N=50𝑁50N=50italic_N = 50Trans-GLM0.5860.6450.7100.7520.7920.823-
s=1𝑠1s=1italic_s = 1KG-WDRO0.5830.6460.7130.7510.7980.8230.654
N=80𝑁80N=80italic_N = 80Trans-GLM0.5840.6460.7140.7550.8000.826-
s=0.5𝑠0.5s=0.5italic_s = 0.5KG-WDRO0.5810.6340.6900.7210.7620.7870.549
N=20𝑁20N=20italic_N = 20Trans-GLM0.5790.6260.6740.7080.7480.760-
s=0.5𝑠0.5s=0.5italic_s = 0.5KG-WDRO0.5800.6350.6890.7280.7680.7940.588
N=50𝑁50N=50italic_N = 50Trans-GLM0.5790.6330.6930.7230.7690.789-
s=0.5𝑠0.5s=0.5italic_s = 0.5KG-WDRO0.5810.6370.7000.7320.7750.7900.617
N=80𝑁80N=80italic_N = 80Trans-GLM0.5810.6380.7020.7370.7790.799-

A.1.2 Simulation 2: Linear Regression

Settingρ=0.3𝜌0.3\rho=0.3italic_ρ = 0.3ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5ρ=0.7𝜌0.7\rho=0.7italic_ρ = 0.7ρ=0.8𝜌0.8\rho=0.8italic_ρ = 0.8ρ=0.9𝜌0.9\rho=0.9italic_ρ = 0.9ρ=0.95𝜌0.95\rho=0.95italic_ρ = 0.95WDRO
s=1𝑠1s=1italic_s = 1KG-WDRO (Strong)0.5850.6450.7400.8010.8700.9120.108
N=50𝑁50N=50italic_N = 50KG-WDRO (Weak)0.5830.6460.7410.8000.8710.910-
Trans-Ridge0.3910.5480.7060.7860.8700.915-
s=1𝑠1s=1italic_s = 1KG-WDRO (Strong)0.7070.7450.8030.8430.8940.9240.513
N=70𝑁70N=70italic_N = 70KG-WDRO (Weak)0.7040.7430.8030.8420.8920.923-
Trans-Ridge0.5990.6920.7880.8380.8930.925-
s=1𝑠1s=1italic_s = 1KG-WDRO (Strong)0.8060.8270.8590.8810.9110.9320.758
N=90𝑁90N=90italic_N = 90KG-WDRO (Weak)0.8040.8250.8570.8800.9100.930-
Trans-Ridge0.7620.8020.8490.8770.9100.932-
s=0.8𝑠0.8s=0.8italic_s = 0.8KG-WDRO (Strong)0.5630.6210.7160.7770.8500.8940.030
N=50𝑁50N=50italic_N = 50KG-WDRO (Weak)0.5610.6220.7160.7770.8490.892-
Trans-Ridge0.2130.4050.6000.7000.8030.858-
s=0.8𝑠0.8s=0.8italic_s = 0.8KG-WDRO (Strong)0.6730.7130.7740.8180.8720.9050.361
N=70𝑁70N=70italic_N = 70KG-WDRO (Weak)0.6700.7100.7740.8160.8690.903-
Trans-Ridge0.4700.5850.7040.7680.8370.875-
s=0.8𝑠0.8s=0.8italic_s = 0.8KG-WDRO (Strong)0.7680.7910.8260.8510.8860.9110.703
N=90𝑁90N=90italic_N = 90KG-WDRO (Weak)0.7650.7880.8250.8510.8850.909-
Trans-Ridge0.6710.7240.7850.8210.8630.890-

A.1.3 Simulation 3: Multi-Sites

Here, recall that the notation ϱitalic-ϱ\varrhoitalic_ϱ denote the correlation of generating the three prior knowledge under the scheme (4).

Settingρ=0.7𝜌0.7\rho=0.7italic_ρ = 0.7ρ=0.75𝜌0.75\rho=0.75italic_ρ = 0.75ρ=0.8𝜌0.8\rho=0.8italic_ρ = 0.8ρ=0.85𝜌0.85\rho=0.85italic_ρ = 0.85ρ=0.9𝜌0.9\rho=0.9italic_ρ = 0.9ρ=0.95𝜌0.95\rho=0.95italic_ρ = 0.95WDRO
[1,0.5,0.2]10.50.2[1,-0.5,0.2][ 1 , - 0.5 , 0.2 ]KG-WDRO0.5600.6400.7130.7830.8500.916-0.584
ϱ=0.9,N=50formulae-sequenceitalic-ϱ0.9𝑁50\varrho=0.9,N=50italic_ϱ = 0.9 , italic_N = 50Trans-Lasso0.5780.6250.6730.7230.7670.815-
[1,0.5,0.2]10.50.2[1,-0.5,0.2][ 1 , - 0.5 , 0.2 ]KG-WDRO0.6740.7280.7760.8250.8750.9260.027
ϱ=0.9,N=60formulae-sequenceitalic-ϱ0.9𝑁60\varrho=0.9,N=60italic_ϱ = 0.9 , italic_N = 60Trans-Lasso0.6660.6970.7320.7700.8080.850-
[1,0.5,0.2]10.50.2[1,-0.5,0.2][ 1 , - 0.5 , 0.2 ]KG-WDRO0.7930.8200.8480.8780.9070.9390.375
ϱ=0.9,N=70formulae-sequenceitalic-ϱ0.9𝑁70\varrho=0.9,N=70italic_ϱ = 0.9 , italic_N = 70Trans-Lasso0.7560.7790.8050.8320.8570.882-
[1,1,1]111[1,1,1][ 1 , 1 , 1 ]KG-WDRO0.5650.6420.7150.7850.8520.916-2.837
ϱ=0.6,N=50formulae-sequenceitalic-ϱ0.6𝑁50\varrho=0.6,N=50italic_ϱ = 0.6 , italic_N = 50Trans-Lasso0.6280.6800.7350.7900.8380.889-
[1,1,1]111[1,1,1][ 1 , 1 , 1 ]KG-WDRO0.6730.7290.7780.8290.8770.928-0.015
ϱ=0.6,N=60formulae-sequenceitalic-ϱ0.6𝑁60\varrho=0.6,N=60italic_ϱ = 0.6 , italic_N = 60Trans-Lasso0.7080.7440.7860.8260.8630.902-
[1,1,1]111[1,1,1][ 1 , 1 , 1 ]KG-WDRO0.7970.8250.8520.8800.9110.9420.354
ϱ=0.6,N=70formulae-sequenceitalic-ϱ0.6𝑁70\varrho=0.6,N=70italic_ϱ = 0.6 , italic_N = 70Trans-Lasso0.7940.8200.8440.8680.8940.919-

A.2 Simulation Setup

Let Ber(p)Ber𝑝{\rm Ber}(p)roman_Ber ( italic_p ) denote a bernoulli distribution with probability parameter p𝑝pitalic_p, 𝒰[a,b]𝒰𝑎𝑏\mathcal{U}[a,b]caligraphic_U [ italic_a , italic_b ] denote a uniform distribution supported on [a,b]𝑎𝑏[a,b][ italic_a , italic_b ], and 𝒩(μ,σ2)𝒩𝜇superscript𝜎2\mathcal{N}(\mu,\sigma^{2})caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) denote a univariate normal distribution with mean μ𝜇\muitalic_μ and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

A.2.1 Simulation 1: Logistic Regression

In this simulation, the coefficients are generated in a high-dimensional sparse setting. The dimension of the nonzero components is set to 50, which is then augmented with 100 zero components to introduce sparsity. The nonzero components of the true coefficient-prior pair (β,θ)𝛽𝜃(\beta,\theta)( italic_β , italic_θ ) are generated using the multivariate normal scheme in (4), with component variance σ2=0.4superscript𝜎20.4\sigma^{2}=0.4italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.4 and ρ{0.3,0.5,0.7,0.8,0.9,0.95}𝜌0.30.50.70.80.90.95\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 }. The target labels are generated as YtargetBer(1/(1+exp(β𝖳X))Y_{\rm target}\sim{\rm Ber}\left(1/(1+\exp{(\shortminus\beta^{%\scriptscriptstyle\sf T}X)}\right)italic_Y start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ∼ roman_Ber ( 1 / ( 1 + roman_exp ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) ), and the source labels are generated as YsourceBer(1/(1+exp(θ𝖳X))Y_{\rm source}\sim{\rm Ber}\left(1/(1+\exp{(\shortminus\theta^{%\scriptscriptstyle\sf T}X)}\right)italic_Y start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT ∼ roman_Ber ( 1 / ( 1 + roman_exp ( italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) ), where X𝒰[2,2]150similar-to𝑋𝒰superscript22150X\sim\mathcal{U}[-2,2]^{150}italic_X ∼ caligraphic_U [ - 2 , 2 ] start_POSTSUPERSCRIPT 150 end_POSTSUPERSCRIPT. The sample size N𝑁Nitalic_N for (Xtarget,Ytarget)subscript𝑋targetsubscript𝑌target(X_{\rm target},Y_{\rm target})( italic_X start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ) is varied across {20,50,80}205080\{20,50,80\}{ 20 , 50 , 80 }, while the sample size for the source data (Xsource,Ysource)subscript𝑋sourcesubscript𝑌source(X_{\rm source},Y_{\rm source})( italic_X start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT ) is fixed at 800. Each dataset is paired with a validation set of the same size for hyperparameter selection.

Let grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 1111 with 10101010 log-spaced values, and let grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 2222 with 20202020 log-spaced values. The βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT estimator is learned by selecting the best-performing hyperparameter on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using validation data. For the 𝒜𝒜\mathcal{A}caligraphic_A-Trans-GLM learner (Tian and Feng,, 2023, Algorithm 1), the transferring step is optimized using grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the debiasing step is optimized using grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For the KG-WDRO learner βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT proposed in Theorem 3 with p=1𝑝1p=1italic_p = 1, the prior θ𝜃\thetaitalic_θ is first learned from the source data using the vanilla WDRO method on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, followed by learning βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT on grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the learned θWDROsubscript𝜃WDRO\theta_{\rm WDRO}italic_θ start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT as input.

The simulations are conducted on the parameter grid

N{20,50,80}×ρ{0.3,0.5,0.7,0.8,0.9,0.95}×s{0.5,1},𝑁205080𝜌0.30.50.70.80.90.95𝑠0.51N\in\{20,50,80\}\times\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}\times s\in\{0.5,1\},italic_N ∈ { 20 , 50 , 80 } × italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 } × italic_s ∈ { 0.5 , 1 } ,

with each configuration repeated 100100100100 times. The average results are reported.

A.2.2 Simulation 2: Linear Regression

In this simulation, the coefficients are generated in a high-dimensional correlated setting. The dimension of the coefficients is set to 100 and the components of the true coefficient-prior pair (β,θ)𝛽𝜃(\beta,\theta)( italic_β , italic_θ ) are generated using the multivariate normal scheme in (4), with component variance σ2=0.1superscript𝜎20.1\sigma^{2}=0.1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.1 and ρ{0.3,0.5,0.7,0.8,0.9,0.95}𝜌0.30.50.70.80.90.95\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 }. The target labels are generated as Ytarget𝒩(β𝖳X,0.5)similar-tosubscript𝑌target𝒩superscript𝛽𝖳𝑋0.5Y_{\rm target}\sim\mathcal{N}(\beta^{\scriptscriptstyle\sf T}X,\sqrt{0.5})italic_Y start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X , square-root start_ARG 0.5 end_ARG ), and the source labels are generated as Ysource𝒩(θ𝖳X,0.5)similar-tosubscript𝑌source𝒩superscript𝜃𝖳𝑋0.5Y_{\rm source}\sim\mathcal{N}(\theta^{\scriptscriptstyle\sf T}X,\sqrt{0.5})italic_Y start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X , square-root start_ARG 0.5 end_ARG ), where X𝒩(𝟎,Σ)similar-to𝑋𝒩0ΣX\sim\mathcal{N}(\mathbf{0},\Sigma)italic_X ∼ caligraphic_N ( bold_0 , roman_Σ ) with

Σi,j={1ifi=j,0.3ifij,for alli,j=1,2,,100.formulae-sequencesubscriptΣ𝑖𝑗cases1if𝑖𝑗0.3if𝑖𝑗for all𝑖𝑗12100\Sigma_{i,j}=\begin{cases}1&\text{if }i=j,\\0.3&\text{if }i\neq j,\end{cases}\quad\text{for all }i,j=1,2,\dots,100.roman_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_i = italic_j , end_CELL end_ROW start_ROW start_CELL 0.3 end_CELL start_CELL if italic_i ≠ italic_j , end_CELL end_ROW for all italic_i , italic_j = 1 , 2 , … , 100 .

The sample size N𝑁Nitalic_N for (Xtarget,Ytarget)subscript𝑋targetsubscript𝑌target(X_{\rm target},Y_{\rm target})( italic_X start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ) is varied across {50,70,90}507090\{50,70,90\}{ 50 , 70 , 90 }, while the sample size for the source data (Xsource,Ysource)subscript𝑋sourcesubscript𝑌source(X_{\rm source},Y_{\rm source})( italic_X start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT roman_source end_POSTSUBSCRIPT ) is fixed at 800. Each dataset is paired with a validation set of the same size for hyperparameter selection.

Let grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 1111 with 10101010 log-spaced values, and let grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 1.51.51.51.5 with 20202020 log-spaced values. The βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT estimator is learned by selecting the best-performing hyperparameter on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using validation data. For the Trans-Ridge learner adapted from (Li etal.,, 2021, Algorithm 1), the transferring step is optimized using grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the debiasing step is optimized using grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For the KG-WDRO learner βKGstrongsubscript𝛽KGstrong\beta_{\rm KGstrong}italic_β start_POSTSUBSCRIPT roman_KGstrong end_POSTSUBSCRIPT proposed in Theorem 1 with p=2𝑝2p=2italic_p = 2, and the βKGweaksubscript𝛽KGweak\beta_{\rm KGweak}italic_β start_POSTSUBSCRIPT roman_KGweak end_POSTSUBSCRIPT learner proposed in Theorem 2, the prior θ𝜃\thetaitalic_θ is first learned from the source data using the vanilla WDRO method on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, followed by learning βKGstrongsubscript𝛽KGstrong\beta_{\rm KGstrong}italic_β start_POSTSUBSCRIPT roman_KGstrong end_POSTSUBSCRIPT on grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the learned θWDROsubscript𝜃WDRO\theta_{\rm WDRO}italic_θ start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT as input. The λ1superscript𝜆1\lambda^{-1}italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT grid for βKGweaksubscript𝛽KGweak\beta_{\rm KGweak}italic_β start_POSTSUBSCRIPT roman_KGweak end_POSTSUBSCRIPT is 0.0001 to 8 with 20 log-spaced values.

The simulations are conducted on the parameter grid

N{50,70,90}×ρ{0.3,0.5,0.7,0.8,0.9,0.95}×s{0.8,1},𝑁507090𝜌0.30.50.70.80.90.95𝑠0.81N\in\{50,70,90\}\times\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}\times s\in\{0.8,1\},italic_N ∈ { 50 , 70 , 90 } × italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 } × italic_s ∈ { 0.8 , 1 } ,

with each configuration repeated 100100100100 times. The average results are reported.

A.2.3 Simulation 3: Multiple Sites

In this simulation, the coefficients are generated in a high-dimensional sparse setting. The dimension of the nonzero components is set to 50, which is then augmented with 100 zero components to introduce sparsity. The number of external source is 3, we generate the their coefficients θ1,θ2,θ3subscript𝜃1subscript𝜃2subscript𝜃3\theta_{1},\theta_{2},\theta_{3}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT using the scheme (4). We construct a linear combination, θS=aθ1+bθ2+cθ3subscript𝜃𝑆𝑎subscript𝜃1𝑏subscript𝜃2𝑐subscript𝜃3\theta_{S}=a\theta_{1}+b\theta_{2}+c\theta_{3}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_a italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and generate the target coefficient β=ρθS+ε𝛽𝜌subscript𝜃𝑆𝜀\beta=\rho\theta_{S}+\varepsilonitalic_β = italic_ρ italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_ε, where εN(0,(1ρ2)Var(θS))similar-to𝜀𝑁01superscript𝜌2Varsubscript𝜃𝑆\varepsilon\sim N(0,(1-\rho^{2})\mathrm{Var}(\theta_{S}))italic_ε ∼ italic_N ( 0 , ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_Var ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ), ensuring Corr(β,θS)=ρCorr𝛽subscript𝜃𝑆𝜌\mathrm{Corr}(\beta,\theta_{S})=\rhoroman_Corr ( italic_β , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = italic_ρ. The target coefficient β𝛽\betaitalic_β is then scaled to match the magnitude of θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

The target labels are generated as Ytarget𝒩(β𝖳X,0.5)similar-tosubscript𝑌target𝒩superscript𝛽𝖳𝑋0.5Y_{\rm target}\sim\mathcal{N}(\beta^{\scriptscriptstyle\sf T}X,\sqrt{0.5})italic_Y start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X , square-root start_ARG 0.5 end_ARG ), and the source labels are generated as Ysource,m𝒩(θm𝖳X,0.5)similar-tosubscript𝑌source𝑚𝒩superscriptsubscript𝜃𝑚𝖳𝑋0.5Y_{{\rm source},m}\sim\mathcal{N}(\theta_{m}^{\scriptscriptstyle\sf T}X,\sqrt{%0.5})italic_Y start_POSTSUBSCRIPT roman_source , italic_m end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X , square-root start_ARG 0.5 end_ARG ) for m[3]𝑚delimited-[]3m\in[3]italic_m ∈ [ 3 ], where X𝒩(𝟎,Σ)similar-to𝑋𝒩0ΣX\sim\mathcal{N}(\mathbf{0},\Sigma)italic_X ∼ caligraphic_N ( bold_0 , roman_Σ ) with

Σi,j={1ifi=j,0.1ifij,for alli,j=1,2,,150.formulae-sequencesubscriptΣ𝑖𝑗cases1if𝑖𝑗0.1if𝑖𝑗for all𝑖𝑗12150\Sigma_{i,j}=\begin{cases}1&\text{if }i=j,\\0.1&\text{if }i\neq j,\end{cases}\quad\text{for all }i,j=1,2,\dots,150.roman_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_i = italic_j , end_CELL end_ROW start_ROW start_CELL 0.1 end_CELL start_CELL if italic_i ≠ italic_j , end_CELL end_ROW for all italic_i , italic_j = 1 , 2 , … , 150 .

The sample size for the target data ranges in {50,60,70}506070\{50,60,70\}{ 50 , 60 , 70 }.

Let grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 1111 with 15151515 log-spaced values, and let grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote a hyperparameter grid ranging from 0.00010.00010.00010.0001 to 3333 with 20202020 log-spaced values. The βWDROsubscript𝛽WDRO\beta_{\rm WDRO}italic_β start_POSTSUBSCRIPT roman_WDRO end_POSTSUBSCRIPT estimator is learned by selecting the best-performing hyperparameter on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using validation data. For the oracle Trans-Lasso learner (Li etal.,, 2021, Algorithm 1), the transferring step is optimized using grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the debiasing step is optimized using grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using all three source data. For the KG-WDRO learner βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT proposed in Theorem 1 with p=1𝑝1p=1italic_p = 1, the priors θ1,θ2,θ3subscript𝜃1subscript𝜃2subscript𝜃3\theta_{1},\theta_{2},\theta_{3}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are first learned from the three source data using the vanilla WDRO method on grid1subscriptgrid1\rm grid_{1}roman_grid start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, followed by learning βKGsubscript𝛽KG\beta_{\rm KG}italic_β start_POSTSUBSCRIPT roman_KG end_POSTSUBSCRIPT on grid2subscriptgrid2\rm grid_{2}roman_grid start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the learned θ1,WDRO,θ2,WDRO,θ3,WDROsubscript𝜃1WDROsubscript𝜃2WDROsubscript𝜃3WDRO\theta_{1,\rm WDRO},\theta_{2,\rm WDRO},\theta_{3,\rm WDRO}italic_θ start_POSTSUBSCRIPT 1 , roman_WDRO end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 , roman_WDRO end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 , roman_WDRO end_POSTSUBSCRIPT as input.

The simulations are conducted on the parameter grid

N{50,60,70}×ρ{0.3,0.5,0.7,0.8,0.9,0.95}×[a,b,c]{[1,0.5,0.2],[1,1,1]},𝑁506070𝜌0.30.50.70.80.90.95𝑎𝑏𝑐10.50.2111N\in\{50,60,70\}\times\rho\in\{0.3,0.5,0.7,0.8,0.9,0.95\}\times[a,b,c]\in\{[1,%-0.5,0.2],[1,1,1]\},italic_N ∈ { 50 , 60 , 70 } × italic_ρ ∈ { 0.3 , 0.5 , 0.7 , 0.8 , 0.9 , 0.95 } × [ italic_a , italic_b , italic_c ] ∈ { [ 1 , - 0.5 , 0.2 ] , [ 1 , 1 , 1 ] } ,

with each configuration repeated 100100100100 times. The average results are reported.

Appendix B Proof of Results in Regression.

Lemma A1.

Let fβ:d:subscript𝑓𝛽superscript𝑑f_{\beta}:\mathbb{R}^{d}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R be defined as Δd(β𝖳Δ)22r(β)β𝖳ΔΔsuperscript𝑑maps-tosuperscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δ\Delta\in\mathbb{R}^{d}\mapsto(\beta^{\scriptscriptstyle\sf T}\Delta)^{2}-2r(%\beta)\beta^{\scriptscriptstyle\sf T}\Deltaroman_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ depending on some βd𝛽superscript𝑑\beta\in\mathbb{R}^{d}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and let r(β)𝑟𝛽r(\beta)italic_r ( italic_β ) be a non-negative real-valued function in β𝛽\betaitalic_β. Then the convex conjugate fβ(Δ):d:superscriptsubscript𝑓𝛽superscriptΔsuperscript𝑑f_{\beta}^{*}(\Delta^{*}):\mathbb{R}^{d}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is given by

fβ(Δ)={(β𝖳Δ+2r(β)β22)24β24ifΔspanβ,+otherwise.superscriptsubscript𝑓𝛽superscriptΔcasessuperscriptsuperscript𝛽𝖳superscriptΔ2𝑟𝛽superscriptsubscriptdelimited-∥∥𝛽2224superscriptsubscriptdelimited-∥∥𝛽24ifsuperscriptΔspan𝛽otherwisef_{\beta}^{*}(\Delta^{*})=\begin{cases}\dfrac{(\beta^{\scriptscriptstyle\sf T}%\Delta^{*}+2r(\beta)\lVert\beta\rVert_{2}^{2})^{2}}{4\lVert\beta\rVert_{2}^{4}%}&\text{if }\Delta^{*}\in\operatorname{span}{\beta},\\+\infty&\text{otherwise}.\end{cases}italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL divide start_ARG ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 2 italic_r ( italic_β ) ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_span italic_β , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise . end_CELL end_ROW

Therefore the biconjugate fβ(Δ):d:superscriptsubscript𝑓𝛽absentΔsuperscript𝑑f_{\beta}^{**}(\Delta):\mathbb{R}^{d}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R of fβ(Δ)subscript𝑓𝛽Δf_{\beta}(\Delta)italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) has representation:

fβ(Δ)=supα(α(β𝖳Δ)(α+2r(β))24).superscriptsubscript𝑓𝛽absentΔsubscriptsupremum𝛼𝛼superscript𝛽𝖳Δsuperscript𝛼2𝑟𝛽24f_{\beta}^{**}(\Delta)=\sup_{\alpha\in\mathbb{R}}\left(\alpha(\beta^{%\scriptscriptstyle\sf T}\Delta)-\dfrac{(\alpha+2r(\beta))^{2}}{4}\right).italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ ) = roman_sup start_POSTSUBSCRIPT italic_α ∈ blackboard_R end_POSTSUBSCRIPT ( italic_α ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ) .
Proof.

The convex conjugate fβ(Δ)superscriptsubscript𝑓𝛽superscriptΔf_{\beta}^{*}(\Delta^{*})italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is defined as

fβ(Δ)supΔd(ΔΔ𝖳(β𝖳Δ)2+2r(β)(β𝖳Δ)),superscriptsubscript𝑓𝛽superscriptΔsubscriptsupremumΔsuperscript𝑑superscriptΔsuperscriptΔ𝖳superscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δf_{\beta}^{*}(\Delta^{*})\coloneqq\sup_{\Delta\in\mathbb{R}^{d}}\big{(}\Delta^%{*}{{}^{\scriptscriptstyle\sf T}}\Delta-(\beta^{\scriptscriptstyle\sf T}\Delta%)^{2}+2r(\beta)(\beta^{\scriptscriptstyle\sf T}\Delta)\big{)},italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≔ roman_sup start_POSTSUBSCRIPT roman_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT roman_Δ - ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_r ( italic_β ) ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) ) ,

where Δ,βdsuperscriptΔ𝛽superscript𝑑\Delta^{*},\beta\in\mathbb{R}^{d}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and r(β)𝑟𝛽r(\beta)\in\mathbb{R}italic_r ( italic_β ) ∈ blackboard_R are taken as fixed values. Orthogonalize Δ=aβ+ωΔ𝑎𝛽𝜔\Delta=a\beta+\omegaroman_Δ = italic_a italic_β + italic_ω in the direction of β𝛽\betaitalic_β with a𝑎a\in\mathbb{R}italic_a ∈ blackboard_R, and ωd𝜔superscript𝑑\omega\in\mathbb{R}^{d}italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that β𝖳ω=0superscript𝛽𝖳𝜔0\beta^{\scriptscriptstyle\sf T}\omega=0italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_ω = 0. Then , we have ΔΔ𝖳=aΔβ𝖳+Δω𝖳superscriptΔsuperscriptΔ𝖳𝑎superscriptΔsuperscript𝛽𝖳superscriptΔsuperscript𝜔𝖳\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\Delta=a\Delta^{*}{{}^{%\scriptscriptstyle\sf T}}\beta+\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\omegaroman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT roman_Δ = italic_a roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_β + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_ω, and the convex conjugate becomes

f(Δ)superscript𝑓superscriptΔ\displaystyle f^{*}(\Delta^{*})italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=supa,ω(a(Δβ𝖳)+Δω𝖳a2β24+2ar(β)β22)absentsubscriptsupremum𝑎𝜔𝑎superscriptΔsuperscript𝛽𝖳superscriptΔsuperscript𝜔𝖳superscript𝑎2superscriptsubscriptnorm𝛽242𝑎𝑟𝛽superscriptsubscriptnorm𝛽22\displaystyle=\sup_{a,\omega}\big{(}a(\Delta^{*}{{}^{\scriptscriptstyle\sf T}}%\beta)+\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\omega-a^{2}\|\beta\|_{2}^{4}+2%ar(\beta)\|\beta\|_{2}^{2}\big{)}= roman_sup start_POSTSUBSCRIPT italic_a , italic_ω end_POSTSUBSCRIPT ( italic_a ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_β ) + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_ω - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 2 italic_a italic_r ( italic_β ) ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
s.tformulae-sequencest\displaystyle\rm{s.t}roman_s . roman_tβ𝖳ω=0.superscript𝛽𝖳𝜔0\displaystyle\quad\beta^{\scriptscriptstyle\sf T}\omega=0.italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_ω = 0 .

Fixing ω𝜔\omegaitalic_ω, the objective is a negative quadratic function in a𝑎aitalic_a, hence the objective in a𝑎aitalic_a is bounded from above by a finite value. Now, if ΔsuperscriptΔ\Delta^{*}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is not orthogonal to ω𝜔\omegaitalic_ω, the term supωΔω𝖳subscriptsupremum𝜔superscriptΔsuperscript𝜔𝖳\sup_{\omega}\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\omegaroman_sup start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_ω is unbounded, and the convex conjugate f(Δ)=+superscript𝑓superscriptΔf^{*}(\Delta^{*})=+\inftyitalic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = + ∞. If ΔsuperscriptΔ\Delta^{*}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is orthogonal to ω𝜔\omegaitalic_ω, then the convex conjugate attains finite value. Note that Δω𝖳=0ΔspanβiffsuperscriptΔsuperscript𝜔𝖳0superscriptΔspan𝛽\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\omega=0\iff\Delta^{*}\in\operatorname%{span}{\beta}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_ω = 0 ⇔ roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_span italic_β. Hence condition on {Δ=αβ;α}formulae-sequencesuperscriptΔ𝛼𝛽𝛼\{\Delta^{*}=\alpha\beta\,;\alpha\in\mathbb{R}\}{ roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_α italic_β ; italic_α ∈ blackboard_R }, we have

f(Δ)superscript𝑓superscriptΔ\displaystyle f^{*}(\Delta^{*})italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=supa(a(Δβ𝖳)a2β24+2r(β)aβ22)absentsubscriptsupremum𝑎𝑎superscriptΔsuperscript𝛽𝖳superscript𝑎2superscriptsubscriptnorm𝛽242𝑟𝛽𝑎superscriptsubscriptnorm𝛽22\displaystyle=\sup_{a}\big{(}a(\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\beta)-%a^{2}\|\beta\|_{2}^{4}+2r(\beta)a\|\beta\|_{2}^{2}\big{)}= roman_sup start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_β ) - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 2 italic_r ( italic_β ) italic_a ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=(Δβ𝖳+2r(β)β22)24β24,absentsuperscriptsuperscriptΔsuperscript𝛽𝖳2𝑟𝛽superscriptsubscriptnorm𝛽2224superscriptsubscriptnorm𝛽24\displaystyle=\dfrac{\big{(}\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\beta+2r(%\beta)\|\beta\|_{2}^{2}\big{)}^{2}}{4\|\beta\|_{2}^{4}},= divide start_ARG ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_β + 2 italic_r ( italic_β ) ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ,

where the optimal solution is a=α+2r(β)2β22superscript𝑎𝛼2𝑟𝛽2superscriptsubscriptnorm𝛽22a^{*}=\dfrac{\alpha+2r(\beta)}{2\|\beta\|_{2}^{2}}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG italic_α + 2 italic_r ( italic_β ) end_ARG start_ARG 2 ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, and the coefficient α𝛼\alphaitalic_α is given by the projection scalar α=Δβ𝖳β22𝛼superscriptΔsuperscript𝛽𝖳superscriptsubscriptnorm𝛽22\alpha=\dfrac{\Delta^{*}{{}^{\scriptscriptstyle\sf T}}\beta}{\|\beta\|_{2}^{2}}italic_α = divide start_ARG roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_β end_ARG start_ARG ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

The biconjuagte

f(Δ)=supΔ(Δ𝖳Δf(Δ)),superscript𝑓absentΔsubscriptsupremumsuperscriptΔsuperscriptΔ𝖳superscriptΔsuperscript𝑓superscriptΔf^{**}(\Delta)=\sup_{\Delta^{*}}\big{(}\Delta^{\scriptscriptstyle\sf T}\Delta^%{*}-f^{*}(\Delta^{*})\big{)},italic_f start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ ) = roman_sup start_POSTSUBSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ,

is therefore bounded from below if and only if ΔspanβsuperscriptΔspan𝛽\Delta^{*}\in\operatorname{span}{\beta}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_span italic_β. Let Δ=αβsuperscriptΔ𝛼𝛽\Delta^{*}=\alpha\betaroman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_α italic_β for some α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R, then substituting we get the representation,

f(Δ)superscript𝑓absentΔ\displaystyle f^{**}(\Delta)italic_f start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ )=supα(Δ𝖳(αβ)(β𝖳(αβ)+2r(β)β22)24β24)absentsubscriptsupremum𝛼superscriptΔ𝖳𝛼𝛽superscriptsuperscript𝛽𝖳𝛼𝛽2𝑟𝛽superscriptsubscriptnorm𝛽2224superscriptsubscriptnorm𝛽24\displaystyle=\sup_{\alpha}\left(\Delta^{\scriptscriptstyle\sf T}(\alpha\beta)%-\dfrac{\big{(}\beta^{\scriptscriptstyle\sf T}(\alpha\beta)+2r(\beta)\|\beta\|%_{2}^{2}\big{)}^{2}}{4\|\beta\|_{2}^{4}}\right)= roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_α italic_β ) - divide start_ARG ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_α italic_β ) + 2 italic_r ( italic_β ) ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG )
=supα(α(Δ𝖳β)(α+2r(β))24).absentsubscriptsupremum𝛼𝛼superscriptΔ𝖳𝛽superscript𝛼2𝑟𝛽24\displaystyle=\sup_{\alpha}\left(\alpha(\Delta^{\scriptscriptstyle\sf T}\beta)%-\dfrac{(\alpha+2r(\beta))^{2}}{4}\right).= roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_α ( roman_Δ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_β ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ) .

It can be readily verified that f(Δ)=f(Δ)superscript𝑓absentΔ𝑓Δf^{**}(\Delta)=f(\Delta)italic_f start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ ) = italic_f ( roman_Δ ) as promised by the Fenchel-Moreau Theorem (Theorem A4).∎

Lemma A2.

Let gθ(Δ):d:subscript𝑔𝜃Δsuperscript𝑑g_{\theta}(\Delta):\mathbb{R}^{d}\to\mathbb{R}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Δ ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R be defined as Δd|θ𝖳Δ|Δsuperscript𝑑maps-tosuperscript𝜃𝖳Δ\Delta\in\mathbb{R}^{d}\mapsto|\theta^{\scriptscriptstyle\sf T}\Delta|roman_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ | italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | for some θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then the convex conjugate gθ(Δ)superscriptsubscript𝑔𝜃superscriptΔg_{\theta}^{*}(\Delta^{*})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is given by

gθ(Δ)={0ifΔ=αθand|α|1,+otherwise.superscriptsubscript𝑔𝜃superscriptΔcases0ifsuperscriptΔ𝛼𝜃and𝛼1otherwise.g_{\theta}^{*}(\Delta^{*})=\begin{cases}0&\text{if }\Delta^{*}=\alpha\theta%\text{ and }|\alpha|\leq 1,\\+\infty&\text{otherwise.}\end{cases}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL if roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_α italic_θ and | italic_α | ≤ 1 , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

Therefore the convex conjugate of the function g(Δ)γm=1Mgθm(Δ)𝑔Δ𝛾superscriptsubscript𝑚1𝑀subscript𝑔subscript𝜃𝑚Δg(\Delta)\coloneqq\gamma\sum_{m=1}^{M}g_{\theta_{m}}(\Delta)italic_g ( roman_Δ ) ≔ italic_γ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Δ ) for some γ>0𝛾0\gamma>0italic_γ > 0 is given by

g(Δ)={0ifΔ=m=1Mαmθmand,|αm|γfor eachm,+otherwise.superscript𝑔superscriptΔcases0formulae-sequenceifsuperscriptΔsuperscriptsubscript𝑚1𝑀subscript𝛼𝑚subscript𝜃𝑚andsubscript𝛼𝑚𝛾for each𝑚otherwise.g^{*}(\Delta^{*})=\begin{cases}0&\text{if }\Delta^{*}=\sum_{m=1}^{M}\alpha_{m}%\theta_{m}\text{ and },|\alpha_{m}|\leq\gamma\text{ for each }m,\\+\infty&\text{otherwise.}\end{cases}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL if roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and , | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_γ for each italic_m , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW
Proof.

The convex conjugate is defined as

gθ(Δ)=supΔ(ΔΔ𝖳|θ𝖳Δ|),superscriptsubscript𝑔𝜃superscriptΔsubscriptsupremumΔsuperscriptΔsuperscriptΔ𝖳superscript𝜃𝖳Δg_{\theta}^{*}(\Delta^{*})=\sup_{\Delta}\big{(}\Delta^{*}{{}^{%\scriptscriptstyle\sf T}}\Delta-|\theta^{\scriptscriptstyle\sf T}\Delta|\big{)},italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT roman_Δ - | italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | ) ,

again, orthogonalize Δ=aθ+ωΔ𝑎𝜃𝜔\Delta=a\theta+\omegaroman_Δ = italic_a italic_θ + italic_ω, where a=θ𝖳Δθ22𝑎superscript𝜃𝖳Δsuperscriptsubscriptnorm𝜃22a=\dfrac{\theta^{\scriptscriptstyle\sf T}\Delta}{\|\theta\|_{2}^{2}}italic_a = divide start_ARG italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and θ𝖳ω=0superscript𝜃𝖳𝜔0\theta^{\scriptscriptstyle\sf T}\omega=0italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_ω = 0. Now by the change of variable uθ𝖳Δ𝑢superscript𝜃𝖳Δu\coloneqq\theta^{\scriptscriptstyle\sf T}\Deltaitalic_u ≔ italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ, the convex conjugate is now

gθ(Δ)subscriptsuperscript𝑔𝜃superscriptΔ\displaystyle g^{*}_{\theta}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=supu,ω(uθ22(Δθ𝖳)+Δω𝖳|u|)absentsubscriptsupremum𝑢𝜔𝑢superscriptsubscriptnorm𝜃22superscriptΔsuperscript𝜃𝖳superscriptΔsuperscript𝜔𝖳𝑢\displaystyle=\sup_{u,\omega}\left(\dfrac{u}{\|\theta\|_{2}^{2}}(\Delta^{*}{{}%^{\scriptscriptstyle\sf T}}\theta)+\Delta^{*}{{}^{\scriptscriptstyle\sf T}}%\omega-|u|\right)= roman_sup start_POSTSUBSCRIPT italic_u , italic_ω end_POSTSUBSCRIPT ( divide start_ARG italic_u end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_θ ) + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_ω - | italic_u | )
s.tθ𝖳ω=0.superscript𝜃𝖳𝜔0\displaystyle\theta^{\scriptscriptstyle\sf T}\omega=0.italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_ω = 0 .

Thus the convex conjugate gθ(Δ)=+superscriptsubscript𝑔𝜃superscriptΔg_{\theta}^{*}(\Delta^{*})=+\inftyitalic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = + ∞ if ΔspanθsuperscriptΔspan𝜃\Delta^{*}\not\in\operatorname{span}{\theta}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∉ roman_span italic_θ. If Δ=αθsuperscriptΔ𝛼𝜃\Delta^{*}=\alpha\thetaroman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_α italic_θ for some α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R, then

gθ(Δ)=gθ(αθ)subscriptsuperscript𝑔𝜃superscriptΔsubscriptsuperscript𝑔𝜃𝛼𝜃\displaystyle g^{*}_{\theta}(\Delta^{*})=g^{*}_{\theta}(\alpha\theta)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_α italic_θ )=supu(uθ22αθ22|u|)absentsubscriptsupremum𝑢𝑢superscriptsubscriptnorm𝜃22𝛼superscriptsubscriptnorm𝜃22𝑢\displaystyle=\sup_{u}\left(\dfrac{u}{\|\theta\|_{2}^{2}}\alpha\|\theta\|_{2}^%{2}-|u|\right)= roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( divide start_ARG italic_u end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_α ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - | italic_u | )
=supu(αu|u|)absentsubscriptsupremum𝑢𝛼𝑢𝑢\displaystyle=\sup_{u}\big{(}\alpha u-|u|\big{)}= roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_α italic_u - | italic_u | )
={0if|α|1,+otherwise,absentcases0if𝛼1otherwise,\displaystyle=\begin{cases}0&\text{if }|\alpha|\leq 1,\\+\infty&\text{otherwise,}\end{cases}= { start_ROW start_CELL 0 end_CELL start_CELL if | italic_α | ≤ 1 , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise, end_CELL end_ROW

where the last equality holds by noting that supuαu|u|=||(α)\sup_{u}\alpha u-|u|=|\cdot|^{*}(\alpha)roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_α italic_u - | italic_u | = | ⋅ | start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) is the convex conjugate of the absolute value function (Proposition A2). This proofs the convex conjugate of gθ(Δ)superscriptsubscript𝑔𝜃superscriptΔg_{\theta}^{*}(\Delta^{*})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Now g(Δ)=γm=1Mgθm(Δ)=γg¯(Δ)𝑔Δ𝛾superscriptsubscript𝑚1𝑀subscript𝑔subscript𝜃𝑚Δ𝛾¯𝑔Δg(\Delta)=\gamma\sum_{m=1}^{M}g_{\theta_{m}}(\Delta)=\gamma\bar{g}(\Delta)italic_g ( roman_Δ ) = italic_γ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Δ ) = italic_γ over¯ start_ARG italic_g end_ARG ( roman_Δ ), the convex conjugate of g¯(Δ)¯𝑔Δ\bar{g}(\Delta)over¯ start_ARG italic_g end_ARG ( roman_Δ ) is

g¯(Δ)superscript¯𝑔superscriptΔ\displaystyle\bar{g}^{*}(\Delta^{*})over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=(gθ1++gθM)(Δ)absentsuperscriptsubscript𝑔subscript𝜃1subscript𝑔subscript𝜃𝑀superscriptΔ\displaystyle=(g_{\theta_{1}}+\ldots+g_{\theta_{M}})^{*}(\Delta^{*})= ( italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + … + italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=infΔ(gθ1(Δ1)++gθM(ΔM))s.tΔ1++ΔM=Δ,absentsubscriptinfimumsuperscriptΔsubscriptsuperscript𝑔subscript𝜃1subscriptsuperscriptΔ1superscriptsubscript𝑔subscript𝜃𝑀superscriptsubscriptΔ𝑀s.tsuperscriptsubscriptΔ1superscriptsubscriptΔ𝑀superscriptΔ\displaystyle=\inf_{\Delta^{*}}\big{(}g^{*}_{\theta_{1}}(\Delta^{*}_{1})+%\ldots+g_{\theta_{M}}^{*}(\Delta_{M}^{*})\big{)}\,\,\,\text{ s.t}\,\,\,\Delta_%{1}^{*}+\ldots+\Delta_{M}^{*}=\Delta^{*},= roman_inf start_POSTSUBSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + … + italic_g start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) s.t roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + … + roman_Δ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,

where the second line follows from the infimal convolution property of sum of convex conjugates (Theorem A5). We know that g¯superscript¯𝑔\bar{g}^{*}over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is finite if and only if gθm(Δm)=0subscriptsuperscript𝑔subscript𝜃𝑚superscriptsubscriptΔ𝑚0g^{*}_{\theta_{m}}(\Delta_{m}^{*})=0italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 for all m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ], that is Δm=αmθmsuperscriptsubscriptΔ𝑚subscript𝛼𝑚subscript𝜃𝑚\Delta_{m}^{*}=\alpha_{m}\theta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for some αm[1,1]subscript𝛼𝑚11\alpha_{m}\in[-1,1]italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ [ - 1 , 1 ] for all m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ]. Hence g¯(Δ)=0superscript¯𝑔superscriptΔ0\bar{g}^{*}(\Delta^{*})=0over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 if and only if Δ=m=1MαmθmsuperscriptΔsuperscriptsubscript𝑚1𝑀subscript𝛼𝑚subscript𝜃𝑚\Delta^{*}=\sum_{m=1}^{M}\alpha_{m}\theta_{m}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and αm[1,1]subscript𝛼𝑚11\alpha_{m}\in[-1,1]italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ [ - 1 , 1 ] for all m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ]. Finally we can calculate the convex conjugate g(Δ)=(γg¯)(Δ)=γg¯(Δγ)superscript𝑔superscriptΔsuperscript𝛾¯𝑔superscriptΔ𝛾superscript¯𝑔superscriptΔ𝛾g^{*}(\Delta^{*})=(\gamma\bar{g})^{*}(\Delta^{*})=\gamma\bar{g}^{*}\left(%\dfrac{\Delta^{*}}{\gamma}\right)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( italic_γ over¯ start_ARG italic_g end_ARG ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_γ over¯ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG ) by the scaling law of convex conjugates (Proposition A3) given γ>0𝛾0\gamma>0italic_γ > 0. This concludes the proof.∎

We now give the proof to Theorem 1.

Proof of Theorem 1.

Let r(β)yβ𝖳x𝑟𝛽𝑦superscript𝛽𝖳𝑥r(\beta)\coloneqq y-\beta^{\scriptscriptstyle\sf T}xitalic_r ( italic_β ) ≔ italic_y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x. Then first consider the cost function

c2((x,y),(u,v))xuq2+|yv|+d(θ1𝖳xθ1𝖳u)++d(θM𝖳xθM𝖳u).subscript𝑐2𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢𝑞2𝑦𝑣𝑑superscriptsubscript𝜃1𝖳𝑥superscriptsubscript𝜃1𝖳𝑢𝑑superscriptsubscript𝜃𝑀𝖳𝑥superscriptsubscript𝜃𝑀𝖳𝑢c_{2}\big{(}(x,y),(u,v)\big{)}\coloneqq\|x-u\|_{q}^{2}+\infty\cdot|y-v|+d(%\theta_{1}^{\scriptscriptstyle\sf T}x-\theta_{1}^{\scriptscriptstyle\sf T}u)+%\ldots+d(\theta_{M}^{\scriptscriptstyle\sf T}x-\theta_{M}^{\scriptscriptstyle%\sf T}u).italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) ≔ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | + italic_d ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) + … + italic_d ( italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) .

where we replaced the transferring strength from ++\infty+ ∞ to a finite-valued distance function d(x)::𝑑𝑥d(x):\mathbb{R}\to\mathbb{R}italic_d ( italic_x ) : blackboard_R → blackboard_R that is a monotone function in |x|𝑥|x|| italic_x |, with d(0)=0𝑑00d(0)=0italic_d ( 0 ) = 0. We will then let d(x)𝑑𝑥d(x)\to\inftyitalic_d ( italic_x ) → ∞ except at x=0𝑥0x=0italic_x = 0. Then the supremum function

ϕγ(x,y;β)=sup(u,v)d+1{(u,v;β)γc((u,v),(x,y))},subscriptitalic-ϕ𝛾𝑥𝑦𝛽subscriptsupremum𝑢𝑣superscript𝑑1𝑢𝑣𝛽𝛾𝑐𝑢𝑣𝑥𝑦\displaystyle\phi_{\gamma}(x,y;\beta)=\sup_{(u,v)\in\mathbb{R}^{d+1}}\big{\{}%\ell(u,v;\beta)-\gamma c\big{(}(u,v),(x,y)\big{)}\big{\}},italic_ϕ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x , italic_y ; italic_β ) = roman_sup start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { roman_ℓ ( italic_u , italic_v ; italic_β ) - italic_γ italic_c ( ( italic_u , italic_v ) , ( italic_x , italic_y ) ) } ,

is finite if and only if v=y𝑣𝑦v=yitalic_v = italic_y. Then, we have

l(u,v;β)γc((u,v),(x,y))𝑙𝑢𝑣𝛽𝛾𝑐𝑢𝑣𝑥𝑦\displaystyle l\left(u,v;\beta\right)-\gamma c\left(\left(u,v\right),(x,y%\right))italic_l ( italic_u , italic_v ; italic_β ) - italic_γ italic_c ( ( italic_u , italic_v ) , ( italic_x , italic_y ) )
=\displaystyle==(yβ𝖳u)2γxuq2γd(θ1𝖳xθ1𝖳u)γd(θM𝖳xθM𝖳u).superscript𝑦superscript𝛽𝖳𝑢2𝛾superscriptsubscriptnorm𝑥𝑢𝑞2𝛾𝑑superscriptsubscript𝜃1𝖳𝑥superscriptsubscript𝜃1𝖳𝑢𝛾𝑑superscriptsubscript𝜃𝑀𝖳𝑥superscriptsubscript𝜃𝑀𝖳𝑢\displaystyle\left(y-\beta^{\scriptscriptstyle\sf T}u\right)^{2}-\gamma\left\|%x-u\right\|_{q}^{2}-\gamma d\left(\theta_{1}^{\scriptscriptstyle\sf T}x-\theta%_{1}^{\scriptscriptstyle\sf T}u\right)-\ldots-\gamma d(\theta_{M}^{%\scriptscriptstyle\sf T}x-\theta_{M}^{\scriptscriptstyle\sf T}u).( italic_y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) - … - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) .

Denote by ΔuxΔ𝑢𝑥\Delta\coloneqq u-xroman_Δ ≔ italic_u - italic_x, we get

l(u,v;β)γc((u,v),(x,y))𝑙𝑢𝑣𝛽𝛾𝑐𝑢𝑣𝑥𝑦\displaystyle l\left(u,v;\beta\right)-\gamma c\left(\left(u,v\right),(x,y%\right))italic_l ( italic_u , italic_v ; italic_β ) - italic_γ italic_c ( ( italic_u , italic_v ) , ( italic_x , italic_y ) )
=\displaystyle==r(β)2+{(β𝖳Δ)22r(β)β𝖳ΔγΔq2γd(θ1𝖳Δ)γd(θM𝖳Δ)}.𝑟superscript𝛽2conditional-setsuperscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δ𝛾evaluated-atΔ𝑞2𝛾𝑑superscriptsubscript𝜃1𝖳Δ𝛾𝑑superscriptsubscript𝜃𝑀𝖳Δ\displaystyle r(\beta)^{2}+\big{\{}(\beta^{\scriptscriptstyle\sf T}\Delta)^{2}%-2r(\beta)\beta^{\scriptscriptstyle\sf T}\Delta-\gamma\|\Delta\|_{q}^{2}-%\gamma d(\theta_{1}^{\scriptscriptstyle\sf T}\Delta)-\ldots-\gamma d(\theta_{M%}^{\scriptscriptstyle\sf T}\Delta)\big{\}}.italic_r ( italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + { ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) - … - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) } .

Consider the objective in ΔΔ\Deltaroman_Δ

supΔ{(β𝖳Δ)22r(β)β𝖳ΔγΔq2γd(θ1𝖳Δ)γd(θM𝖳Δ)}subscriptsupremumΔconditional-setsuperscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δ𝛾evaluated-atΔ𝑞2𝛾𝑑superscriptsubscript𝜃1𝖳Δ𝛾𝑑superscriptsubscript𝜃𝑀𝖳Δ\displaystyle\sup_{\Delta}\big{\{}(\beta^{\scriptscriptstyle\sf T}\Delta)^{2}-%2r(\beta)\beta^{\scriptscriptstyle\sf T}\Delta-\gamma\|\Delta\|_{q}^{2}-\gammad%(\theta_{1}^{\scriptscriptstyle\sf T}\Delta)-\ldots-\gamma d(\theta_{M}^{%\scriptscriptstyle\sf T}\Delta)\big{\}}roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT { ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) - … - italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) }
\displaystyle\coloneqqsupΔ{fβ(Δ)g(Δ)},subscriptsupremumΔsubscript𝑓𝛽Δ𝑔Δ\displaystyle\sup_{\Delta}\big{\{}f_{\beta}(\Delta)-g(\Delta)\big{\}},roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) - italic_g ( roman_Δ ) } ,

where we let fβ(Δ)(β𝖳Δ)22r(β)β𝖳Δsubscript𝑓𝛽Δsuperscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δf_{\beta}(\Delta)\coloneqq(\beta^{\scriptscriptstyle\sf T}\Delta)^{2}-2r(\beta%)\beta^{\scriptscriptstyle\sf T}\Deltaitalic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) ≔ ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ and g(Δ)γΔq2+γd(θ1𝖳Δ)++γd(θM𝖳Δ)𝑔Δ𝛾superscriptsubscriptnormΔ𝑞2𝛾𝑑superscriptsubscript𝜃1𝖳Δ𝛾𝑑superscriptsubscript𝜃𝑀𝖳Δg(\Delta)\coloneqq\gamma\|\Delta\|_{q}^{2}+\gamma d(\theta_{1}^{%\scriptscriptstyle\sf T}\Delta)+\ldots+\gamma d(\theta_{M}^{\scriptscriptstyle%\sf T}\Delta)italic_g ( roman_Δ ) ≔ italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) + … + italic_γ italic_d ( italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ). This is a convex + concave optimization, we express the convex part of fβ(Δ)subscript𝑓𝛽Δf_{\beta}(\Delta)italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) as a supremum of infinitely many affine functions. Then by Lemma A1, we have fβ(Δ)=fβ(Δ)=supα(α(β𝖳Δ)(α+2r(β))24)subscript𝑓𝛽Δsuperscriptsubscript𝑓𝛽absentΔsubscriptsupremum𝛼𝛼superscript𝛽𝖳Δsuperscript𝛼2𝑟𝛽24f_{\beta}(\Delta)=f_{\beta}^{**}(\Delta)=\sup_{\alpha\in\mathbb{R}}\left(%\alpha(\beta^{\scriptscriptstyle\sf T}\Delta)-\dfrac{(\alpha+2r(\beta))^{2}}{4%}\right)italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) = italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( roman_Δ ) = roman_sup start_POSTSUBSCRIPT italic_α ∈ blackboard_R end_POSTSUBSCRIPT ( italic_α ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ), then we may write

supΔ{fβ(Δ)g(Δ)}subscriptsupremumΔsubscript𝑓𝛽Δ𝑔Δ\displaystyle\sup_{\Delta}\big{\{}f_{\beta}(\Delta)-g(\Delta)\big{\}}roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( roman_Δ ) - italic_g ( roman_Δ ) }
=\displaystyle==supΔ{supα(α(β𝖳Δ)(α+2r(β))24)g(Δ)}subscriptsupremumΔsubscriptsupremum𝛼𝛼superscript𝛽𝖳Δsuperscript𝛼2𝑟𝛽24𝑔Δ\displaystyle\sup_{\Delta}\left\{\sup_{\alpha\in\mathbb{R}}\left(\alpha(\beta^%{\scriptscriptstyle\sf T}\Delta)-\dfrac{(\alpha+2r(\beta))^{2}}{4}\right)-g(%\Delta)\right\}roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT { roman_sup start_POSTSUBSCRIPT italic_α ∈ blackboard_R end_POSTSUBSCRIPT ( italic_α ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ) - italic_g ( roman_Δ ) }
=\displaystyle==supα{supΔ(Δ𝖳(αβ)g(Δ))(α+2r(β))24}subscriptsupremum𝛼subscriptsupremumΔsuperscriptΔ𝖳𝛼𝛽𝑔Δsuperscript𝛼2𝑟𝛽24\displaystyle\sup_{\alpha}\left\{\sup_{\Delta}\big{(}\Delta^{%\scriptscriptstyle\sf T}(\alpha\beta)-g(\Delta)\big{)}-\dfrac{\big{(}\alpha+2r%(\beta)\big{)}^{2}}{4}\right\}roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT { roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_α italic_β ) - italic_g ( roman_Δ ) ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG }
=\displaystyle==supα{g(αβ)(α+2r(β))24},subscriptsupremum𝛼superscript𝑔𝛼𝛽superscript𝛼2𝑟𝛽24\displaystyle\sup_{\alpha}\left\{g^{*}(\alpha\beta)-\dfrac{\big{(}\alpha+2r(%\beta)\big{)}^{2}}{4}\right\},roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT { italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α italic_β ) - divide start_ARG ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG } ,(Toland’s Duality)

where gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the convex conjugate of g𝑔gitalic_g. Let g(Δ)g1(Δ)+gθ(Δ)𝑔Δsubscript𝑔1Δsubscript𝑔𝜃Δg(\Delta)\coloneqq g_{1}(\Delta)+g_{\theta}(\Delta)italic_g ( roman_Δ ) ≔ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Δ ) + italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Δ ), with g1(Δ)=γΔq2subscript𝑔1Δ𝛾superscriptsubscriptnormΔ𝑞2g_{1}(\Delta)=\gamma\|\Delta\|_{q}^{2}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Δ ) = italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and gθ(Δ)γm=1Md(θm𝖳Δ)subscript𝑔𝜃Δ𝛾superscriptsubscript𝑚1𝑀𝑑superscriptsubscript𝜃𝑚𝖳Δg_{\theta}(\Delta)\coloneqq\gamma\sum_{m=1}^{M}d(\theta_{m}^{%\scriptscriptstyle\sf T}\Delta)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Δ ) ≔ italic_γ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_d ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ). Then we can compute the convex conjugate of g𝑔gitalic_g using the infimal convolution property (Theorem A5). Then

g(Δ)superscript𝑔superscriptΔ\displaystyle g^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=infΔ1+Δ2=Δ(g1(Δ1)+gθ(Δ2)).absentsubscriptinfimumsubscriptsuperscriptΔ1subscriptsuperscriptΔ2superscriptΔsuperscriptsubscript𝑔1superscriptsubscriptΔ1superscriptsubscript𝑔𝜃superscriptsubscriptΔ2\displaystyle=\inf_{\Delta^{*}_{1}+\Delta^{*}_{2}=\Delta^{*}}\big{(}g_{1}^{*}(%\Delta_{1}^{*})+g_{\theta}^{*}(\Delta_{2}^{*})\big{)}.= roman_inf start_POSTSUBSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) .

We know that g1(Δ1)=14γΔ1p2superscriptsubscript𝑔1superscriptsubscriptΔ114𝛾superscriptsubscriptnormsuperscriptsubscriptΔ1𝑝2g_{1}^{*}(\Delta_{1}^{*})=\dfrac{1}{4\gamma}\|\Delta_{1}^{*}\|_{p}^{2}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 italic_γ end_ARG ∥ roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where p1+q1=1superscript𝑝1superscript𝑞11p^{-1}+q^{-1}=1italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 (Proposition A2). Now suppose d(x)=λ|x|𝑑𝑥𝜆𝑥d(x)=\lambda|x|italic_d ( italic_x ) = italic_λ | italic_x | for some λ>0𝜆0\lambda>0italic_λ > 0, by Lemma A2, we have,

gθ(Δ2)={0ifΔ2=m=1Mαmθmand,|αm|γλfor eachm,+otherwise.superscriptsubscript𝑔𝜃superscriptsubscriptΔ2cases0formulae-sequenceifsuperscriptsubscriptΔ2superscriptsubscript𝑚1𝑀subscript𝛼𝑚subscript𝜃𝑚andsubscript𝛼𝑚𝛾𝜆for each𝑚otherwise.g_{\theta}^{*}(\Delta_{2}^{*})=\begin{cases}0&\text{if }\Delta_{2}^{*}=\sum_{m%=1}^{M}\alpha_{m}\theta_{m}\text{ and },|\alpha_{m}|\leq\gamma\lambda\text{ %for each }m,\\+\infty&\text{otherwise.}\end{cases}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and , | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_γ italic_λ for each italic_m , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

Then the convex conjugate g(Δ)superscript𝑔superscriptΔg^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is

g(Δ)superscript𝑔superscriptΔ\displaystyle g^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=infΔ2g1(ΔΔ2),absentsubscriptinfimumsuperscriptsubscriptΔ2superscriptsubscript𝑔1superscriptΔsuperscriptsubscriptΔ2\displaystyle=\inf_{\Delta_{2}^{*}}g_{1}^{*}(\Delta^{*}-\Delta_{2}^{*}),= roman_inf start_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,
s.tΔ2=m=1Mαms.tsuperscriptsubscriptΔ2superscriptsubscript𝑚1𝑀subscript𝛼𝑚\displaystyle\text{s.t }\,\Delta_{2}^{*}=\sum_{m=1}^{M}\alpha_{m}s.t roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPTθmand,|αm|γλfor eachm,subscript𝜃𝑚andsubscript𝛼𝑚𝛾𝜆for each𝑚\displaystyle\theta_{m}\text{ and },|\alpha_{m}|\leq\gamma\lambda\text{ for %each }m,italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and , | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_γ italic_λ for each italic_m ,

which is equivalently,

g(Δ)superscript𝑔superscriptΔ\displaystyle g^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )=14γinf𝜶Δm=1Mαmθmp2,absent14𝛾subscriptinfimum𝜶superscriptsubscriptnormsuperscriptΔsuperscriptsubscript𝑚1𝑀subscript𝛼𝑚subscript𝜃𝑚𝑝2\displaystyle=\dfrac{1}{4\gamma}\inf_{\boldsymbol{\alpha}}\left\|\Delta^{*}-%\sum_{m=1}^{M}\alpha_{m}\theta_{m}\right\|_{p}^{2},= divide start_ARG 1 end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT bold_italic_α end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
s.t|αm|γλfor eachm.s.tsubscript𝛼𝑚𝛾𝜆for each𝑚\displaystyle\text{s.t }\,|\alpha_{m}|\leq\gamma\lambda\text{ for each }m.s.t | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_γ italic_λ for each italic_m .

Letting λ𝜆\lambda\to\inftyitalic_λ → ∞, we recover the cost function c2,subscript𝑐2c_{2,\infty}italic_c start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT, and when λ𝜆\lambda\to\inftyitalic_λ → ∞, each αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is now free in \mathbb{R}blackboard_R. Then we have g(Δ)=14γinfϑΘΔϑp2superscript𝑔superscriptΔ14𝛾subscriptinfimumitalic-ϑΘsuperscriptsubscriptnormsuperscriptΔitalic-ϑ𝑝2g^{*}(\Delta^{*})=\dfrac{1}{4\gamma}\inf_{\vartheta\in\Theta}\|\Delta^{*}-%\vartheta\|_{p}^{2}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with Θspan{θ1,,θM}Θspansubscript𝜃1subscript𝜃𝑀\Theta\coloneqq\operatorname{span}{\{\theta_{1},\ldots,\theta_{M}\}}roman_Θ ≔ roman_span { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, the validity of this tactic follows from (Luenberger and Ye,, 2008, Theorem 1, Section 13.1). Then we have g(αβ)=14γinfϑΘαβϑp2.superscript𝑔𝛼𝛽14𝛾subscriptinfimumitalic-ϑΘsuperscriptsubscriptnorm𝛼𝛽italic-ϑ𝑝2g^{*}(\alpha\beta)=\dfrac{1}{4\gamma}\inf_{\vartheta\in\Theta}\|\alpha\beta-%\vartheta\|_{p}^{2}.italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α italic_β ) = divide start_ARG 1 end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_α italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . Suppose α0,𝛼0\alpha\neq 0,italic_α ≠ 0 , then dividing by α𝛼\alphaitalic_α, we get

g(αβ)=α24γinfϑΘβϑp2.superscript𝑔𝛼𝛽superscript𝛼24𝛾subscriptinfimumitalic-ϑΘsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2g^{*}(\alpha\beta)=\dfrac{\alpha^{2}}{4\gamma}\inf_{\vartheta\in\Theta}\|\beta%-\vartheta\|_{p}^{2}.italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α italic_β ) = divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

If α=0𝛼0\alpha=0italic_α = 0, then g(αβ)=g(𝟎)=14γinfϑϑp2=0superscript𝑔𝛼𝛽superscript𝑔014𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptnormitalic-ϑ𝑝20g^{*}(\alpha\beta)=g^{*}(\boldsymbol{0})=\dfrac{1}{4\gamma}\inf_{\vartheta}\|%\vartheta\|_{p}^{2}=0italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α italic_β ) = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_0 ) = divide start_ARG 1 end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0, so the representation g(αβ)=α24γinfϑΘβϑp2,superscript𝑔𝛼𝛽superscript𝛼24𝛾subscriptinfimumitalic-ϑΘsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2g^{*}(\alpha\beta)=\dfrac{\alpha^{2}}{4\gamma}\inf_{\vartheta\in\Theta}\|\beta%-\vartheta\|_{p}^{2},italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α italic_β ) = divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , is valid for all α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R. Therefore following the proof of (Blanchet etal., 2019a, , Theorem 1),

ϕγ(x,y;β)subscriptitalic-ϕ𝛾𝑥𝑦𝛽\displaystyle\phi_{\gamma}(x,y;\beta)italic_ϕ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x , italic_y ; italic_β )=r(β)2+14supα{α2γinfϑΘβϑp2(α+2r(β))2}absent𝑟superscript𝛽214subscriptsupremum𝛼conditional-setsuperscript𝛼2𝛾subscriptinfimumitalic-ϑΘ𝛽evaluated-atitalic-ϑ𝑝2superscript𝛼2𝑟𝛽2\displaystyle=r(\beta)^{2}+\dfrac{1}{4}\sup_{\alpha}\left\{\dfrac{\alpha^{2}}{%\gamma}\inf_{\vartheta\in\Theta}\|\beta-\vartheta\|_{p}^{2}-\big{(}\alpha+2r(%\beta)\big{)}^{2}\right\}= italic_r ( italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT { divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_α + 2 italic_r ( italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
=14supα{(infϑβϑp2γ1)α24r(β)α}absent14subscriptsupremum𝛼subscriptinfimumitalic-ϑsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2𝛾1superscript𝛼24𝑟𝛽𝛼\displaystyle=\dfrac{1}{4}\sup_{\alpha}\left\{\left(\dfrac{\inf_{\vartheta}\|%\beta-\vartheta\|_{p}^{2}}{\gamma}-1\right)\alpha^{2}-4r(\beta)\alpha\right\}= divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_sup start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT { ( divide start_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG - 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_r ( italic_β ) italic_α }
={r(β)2γγinfϑβϑp2ifinfϑβϑp2γ,+otherwise.absentcases𝑟superscript𝛽2𝛾𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2ifsubscriptinfimumitalic-ϑsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2𝛾otherwise.\displaystyle=\begin{cases}\dfrac{r(\beta)^{2}\gamma}{\gamma-\inf_{\vartheta}%\|\beta-\vartheta\|_{p}^{2}}&\text{if }\inf_{\vartheta}\|\beta-\vartheta\|_{p}%^{2}\leq\gamma,\\+\infty&\text{otherwise.}\end{cases}= { start_ROW start_CELL divide start_ARG italic_r ( italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ end_ARG start_ARG italic_γ - roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

Then the minimization objective can be simplified as

infβdminγ0{γδ+1ni=1Nϕγ(xi,yi;β)}subscriptinfimum𝛽superscript𝑑subscript𝛾0𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscriptitalic-ϕ𝛾subscript𝑥𝑖subscript𝑦𝑖𝛽\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\min_{\gamma\geq 0}\left\{\gamma%\delta+\dfrac{1}{n}\sum_{i=1}^{N}\phi_{\gamma}(x_{i},y_{i};\beta)\right\}roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_γ ≥ 0 end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) }
=\displaystyle==infβinfγinfϑβϑp2{γδ+1ni=1Nri(β)2γγinfϑβϑp2}subscriptinfimum𝛽subscriptinfimum𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptdelimited-∥∥𝛽italic-ϑ𝑝2𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscript𝑟𝑖superscript𝛽2𝛾𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptdelimited-∥∥𝛽italic-ϑ𝑝2\displaystyle\inf_{\beta}\inf_{\gamma\geq\inf_{\vartheta}\lVert\beta-\vartheta%\rVert_{p}^{2}}\left\{\gamma\delta+\dfrac{1}{n}\sum_{i=1}^{N}\dfrac{r_{i}(%\beta)^{2}\gamma}{\gamma-\inf_{\vartheta}\lVert\beta-\vartheta\rVert_{p}^{2}}\right\}roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_γ ≥ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ end_ARG start_ARG italic_γ - roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG }
=\displaystyle==infβinfγinfϑβϑp2{γδ+MSE(β)γγinfϑβϑp2}subscriptinfimum𝛽subscriptinfimum𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptdelimited-∥∥𝛽italic-ϑ𝑝2𝛾𝛿MSE𝛽𝛾𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptdelimited-∥∥𝛽italic-ϑ𝑝2\displaystyle\inf_{\beta}\inf_{\gamma\geq\inf_{\vartheta}\lVert\beta-\vartheta%\rVert_{p}^{2}}\left\{\gamma\delta+\text{MSE}(\beta)\dfrac{\gamma}{\gamma-\inf%_{\vartheta}\lVert\beta-\vartheta\rVert_{p}^{2}}\right\}roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_γ ≥ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_γ italic_δ + MSE ( italic_β ) divide start_ARG italic_γ end_ARG start_ARG italic_γ - roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG }
=\displaystyle==infβ(MSE(β)+δinfϑβϑp)2,subscriptinfimum𝛽superscriptMSE𝛽𝛿subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝2\displaystyle\inf_{\beta}\left(\sqrt{\text{MSE}(\beta)}+\sqrt{\delta}\inf_{%\vartheta}\|\beta-\vartheta\|_{p}\right)^{2},roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( square-root start_ARG MSE ( italic_β ) end_ARG + square-root start_ARG italic_δ end_ARG roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the last equality follows because γδ+1nMSE(β)γγinfϑβϑp2𝛾𝛿1𝑛MSE𝛽𝛾𝛾subscriptinfimumitalic-ϑsuperscriptsubscriptdelimited-∥∥𝛽italic-ϑ𝑝2\gamma\delta+\dfrac{1}{n}\text{MSE}(\beta)\dfrac{\gamma}{\gamma-\inf_{%\vartheta}\lVert\beta-\vartheta\rVert_{p}^{2}}italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG MSE ( italic_β ) divide start_ARG italic_γ end_ARG start_ARG italic_γ - roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is a convex function in γ𝛾\gammaitalic_γ that tends to ++\infty+ ∞ approaching the boundaries infϑβϑp2subscriptinfimumitalic-ϑsuperscriptsubscriptnorm𝛽italic-ϑ𝑝2\inf_{\vartheta}\|\beta-\vartheta\|_{p}^{2}roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ++\infty+ ∞, so the optimization over γ𝛾\gammaitalic_γ can be solved using first-order condition. Then by Proposition 1, strong duality holds and,

infβsup:𝒟c2(,n)δ𝔼[(Yβ𝖳X)2]=infβ,ϑ(MSE(β)+δβϑp)2.subscriptinfimum𝛽subscriptsupremum:subscript𝒟subscript𝑐2subscript𝑛𝛿subscript𝔼delimited-[]superscript𝑌superscript𝛽𝖳𝑋2subscriptinfimum𝛽italic-ϑsuperscriptMSE𝛽𝛿subscriptnorm𝛽italic-ϑ𝑝2\inf_{\beta}\sup_{\mathbb{P}:\mathcal{D}_{c_{2}}(\mathbb{P},\mathbb{P}_{n})%\leq\delta}\mathbb{E}_{\mathbb{P}}\left[(Y-\beta^{\scriptscriptstyle\sf T}X)^{%2}\right]=\inf_{\beta,\vartheta}\left(\sqrt{\text{MSE}(\beta)}+\sqrt{\delta}\|%\beta-\vartheta\|_{p}\right)^{2}.roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_D start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ ( italic_Y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_inf start_POSTSUBSCRIPT italic_β , italic_ϑ end_POSTSUBSCRIPT ( square-root start_ARG MSE ( italic_β ) end_ARG + square-root start_ARG italic_δ end_ARG ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

This reduces the infinite-dimensional optimization to a finite-dimensional problem, where we interchanged infϑsubscriptinfimumitalic-ϑ\inf_{\vartheta}roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT and the quadratic function, since the quadratic function is monotone increasing on the positive reals.∎

The next proof is to Theorem 2 with the weak transferring cost function c2,λ((x,y),(u,v))=xu22+λ(θ𝖳xθ𝖳u)2+|yv|subscript𝑐2𝜆𝑥𝑦𝑢𝑣superscriptsubscriptnorm𝑥𝑢22𝜆superscriptsuperscript𝜃𝖳𝑥superscript𝜃𝖳𝑢2𝑦𝑣c_{2,\lambda}\big{(}(x,y),(u,v)\big{)}=\|x-u\|_{2}^{2}+\lambda(\theta^{%\scriptscriptstyle\sf T}x-\theta^{\scriptscriptstyle\sf T}u)^{2}+\infty\cdot|y%-v|italic_c start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ( italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∞ ⋅ | italic_y - italic_v | with some λ>0𝜆0\lambda>0italic_λ > 0. The statements generalizes to multi-sites by first considering orthogonalizing the prior knowledge {θ1,,θM}subscript𝜃1subscript𝜃𝑀\{\theta_{1},\ldots,\theta_{M}\}{ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }.

Proof of Theorem 2.

Following the proof of Theorem 1, we solve the optimization problem

supΔd((β𝖳Δ)22r(β)β𝖳ΔγΔ22γλ(θ𝖳Δ)2),subscriptsupremumΔsuperscript𝑑superscriptsuperscript𝛽𝖳Δ22𝑟𝛽superscript𝛽𝖳Δ𝛾superscriptsubscriptnormΔ22𝛾𝜆superscriptsuperscript𝜃𝖳Δ2\sup_{\Delta\in\mathbb{R}^{d}}\big{(}(\beta^{\scriptscriptstyle\sf T}\Delta)^{%2}-2r(\beta)\beta^{\scriptscriptstyle\sf T}\Delta-\gamma\|\Delta\|_{2}^{2}-%\gamma\lambda(\theta^{\scriptscriptstyle\sf T}\Delta)^{2}\big{)},roman_sup start_POSTSUBSCRIPT roman_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ italic_λ ( italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where we recall that γ𝛾\gammaitalic_γ is the dual-variable in statement of Proposition 1, λ>0𝜆0\lambda>0italic_λ > 0 is the transferring strength, θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the prior knowledge, and r(β)=(yβ𝖳x)2𝑟𝛽superscript𝑦superscript𝛽𝖳𝑥2r(\beta)=(y-\beta^{\scriptscriptstyle\sf T}x)^{2}italic_r ( italic_β ) = ( italic_y - italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the residual in β𝛽\betaitalic_β.

Then let 𝕆𝕆\mathbb{O}blackboard_O be an orthogonal matrix, whose first column is θ/θ2𝜃subscriptnorm𝜃2\theta/\|\theta\|_{2}italic_θ / ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then use Δ~𝕆1Δ.~Δsuperscript𝕆1Δ\widetilde{\Delta}\coloneqq\mathbb{O}^{-1}\Delta.over~ start_ARG roman_Δ end_ARG ≔ blackboard_O start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ . The objective function now becomes

(β𝖳𝕆Δ~)22r(β)β𝖳𝕆Δ~γΔ~22γλθ22Δ~12,superscriptsuperscript𝛽𝖳𝕆~Δ22𝑟𝛽superscript𝛽𝖳𝕆~Δ𝛾superscriptsubscriptnorm~Δ22𝛾𝜆superscriptsubscriptnorm𝜃22superscriptsubscript~Δ12(\beta^{\scriptscriptstyle\sf T}\mathbb{O}\widetilde{\Delta})^{2}-2r(\beta)%\beta^{\scriptscriptstyle\sf T}\mathbb{O}\widetilde{\Delta}-\gamma\|\widetilde%{\Delta}\|_{2}^{2}-\gamma\lambda\|\theta\|_{2}^{2}\widetilde{\Delta}_{1}^{2},( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O over~ start_ARG roman_Δ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O over~ start_ARG roman_Δ end_ARG - italic_γ ∥ over~ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_γ italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the last term follows because θ𝖳𝕆=(θ2,0,,0)superscript𝜃𝖳𝕆subscriptnorm𝜃200\theta^{\scriptscriptstyle\sf T}\mathbb{O}=(\|\theta\|_{2},0,\ldots,0)italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O = ( ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0 , … , 0 ), and Δ~1subscript~Δ1\widetilde{\Delta}_{1}over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the first component of Δ~~Δ\widetilde{\Delta}over~ start_ARG roman_Δ end_ARG. Now define

D=diag{λθ22+1,1,,1},𝐷diag𝜆superscriptsubscriptnorm𝜃22111D={\rm diag}{\left\{\sqrt{\lambda\|\theta\|_{2}^{2}+1},1,\ldots,1\right\}},italic_D = roman_diag { square-root start_ARG italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG , 1 , … , 1 } ,

and consider the change of variable Δ¯=DΔ~¯Δ𝐷~Δ\bar{\Delta}=D\widetilde{\Delta}over¯ start_ARG roman_Δ end_ARG = italic_D over~ start_ARG roman_Δ end_ARG, then the last two terms become

Δ~22+λθ22Δ~12=D1Δ¯22+λθ2Δ¯12λθ22+1=i=1dΔ¯d2=Δ¯22.superscriptsubscriptnorm~Δ22𝜆superscriptsubscriptnorm𝜃22superscriptsubscript~Δ12superscriptsubscriptnormsuperscript𝐷1¯Δ22𝜆subscriptnorm𝜃2superscriptsubscript¯Δ12𝜆superscriptsubscriptnorm𝜃221superscriptsubscript𝑖1𝑑superscriptsubscript¯Δ𝑑2superscriptsubscriptnorm¯Δ22\displaystyle\|\widetilde{\Delta}\|_{2}^{2}+\lambda\|\theta\|_{2}^{2}%\widetilde{\Delta}_{1}^{2}=\|D^{-1}\bar{\Delta}\|_{2}^{2}+\lambda\|\theta\|_{2%}\dfrac{\bar{\Delta}_{1}^{2}}{{\lambda\|\theta\|_{2}^{2}+1}}=\sum_{i=1}^{d}%\bar{\Delta}_{d}^{2}=\|\bar{\Delta}\|_{2}^{2}.∥ over~ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG over¯ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over¯ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, the objective becomes

supΔ¯((β𝖳𝕆D1Δ¯)22r(β)β𝖳𝕆D1Δ¯γΔ¯22)subscriptsupremum¯Δsuperscriptsuperscript𝛽𝖳𝕆superscript𝐷1¯Δ22𝑟𝛽superscript𝛽𝖳𝕆superscript𝐷1¯Δ𝛾superscriptsubscriptnorm¯Δ22\displaystyle\sup_{\bar{\Delta}}\big{(}(\beta^{\scriptscriptstyle\sf T}\mathbb%{O}D^{-1}\bar{\Delta})^{2}-2r(\beta)\beta^{\scriptscriptstyle\sf T}\mathbb{O}D%^{-1}\bar{\Delta}-\gamma\|\bar{\Delta}\|_{2}^{2}\big{)}roman_sup start_POSTSUBSCRIPT over¯ start_ARG roman_Δ end_ARG end_POSTSUBSCRIPT ( ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG roman_Δ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG roman_Δ end_ARG - italic_γ ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle==supΔ¯(β𝖳𝕆D122Δ¯222r(β)β𝖳𝕆D12Δ¯2γΔ¯22)subscriptsupremum¯Δsuperscriptsubscriptnormsuperscript𝛽𝖳𝕆superscript𝐷122superscriptsubscriptnorm¯Δ222𝑟𝛽subscriptnormsuperscript𝛽𝖳𝕆superscript𝐷12subscriptnorm¯Δ2𝛾superscriptsubscriptnorm¯Δ22\displaystyle\sup_{\bar{\Delta}}\big{(}\|\beta^{\scriptscriptstyle\sf T}%\mathbb{O}D^{-1}\|_{2}^{2}\|\bar{\Delta}\|_{2}^{2}-2r(\beta)\|\beta^{%\scriptscriptstyle\sf T}\mathbb{O}D^{-1}\|_{2}\|\bar{\Delta}\|_{2}-\gamma\|%\bar{\Delta}\|_{2}^{2})roman_sup start_POSTSUBSCRIPT over¯ start_ARG roman_Δ end_ARG end_POSTSUBSCRIPT ( ∥ italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) ∥ italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_O italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_γ ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle==supΔ¯((βΨλγ)Δ¯222r(β)βΨλΔ¯2)subscriptsupremum¯Δsubscriptnorm𝛽subscriptΨ𝜆𝛾superscriptsubscriptnorm¯Δ222𝑟𝛽subscriptnorm𝛽subscriptΨ𝜆subscriptnorm¯Δ2\displaystyle\sup_{\bar{\Delta}}\big{(}(\|\beta\|_{\Psi_{\lambda}}-\gamma)\|%\bar{\Delta}\|_{2}^{2}-2r(\beta)\|\beta\|_{\Psi_{\lambda}}\|\bar{\Delta}\|_{2}%\big{)}roman_sup start_POSTSUBSCRIPT over¯ start_ARG roman_Δ end_ARG end_POSTSUBSCRIPT ( ( ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_γ ) ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ( italic_β ) ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over¯ start_ARG roman_Δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

which has finite optimal value r(β)2βΨλ2γβΨλ2𝑟superscript𝛽2superscriptsubscriptnorm𝛽subscriptΨ𝜆2𝛾superscriptsubscriptnorm𝛽subscriptΨ𝜆2\dfrac{r(\beta)^{2}\|\beta\|_{\Psi_{\lambda}}^{2}}{\gamma-\|\beta\|_{\Psi_{%\lambda}}^{2}}divide start_ARG italic_r ( italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ - ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG whenever γβΨλ𝛾subscriptnorm𝛽subscriptΨ𝜆\gamma\geq\|\beta\|_{\Psi_{\lambda}}italic_γ ≥ ∥ italic_β ∥ start_POSTSUBSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with ΨλsubscriptΨ𝜆\Psi_{\lambda}roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT denoting the positive-definite symmetric matrix,

Ψλ=Id1θ22+λ1θθ𝖳,subscriptΨ𝜆subscript𝐼𝑑1superscriptsubscriptnorm𝜃22superscript𝜆1𝜃superscript𝜃𝖳\Psi_{\lambda}=I_{d}-\dfrac{1}{\|\theta\|_{2}^{2}+\lambda^{-1}}\theta\theta^{%\scriptscriptstyle\sf T},roman_Ψ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG italic_θ italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ,

that is independent of the choice of 𝕆𝕆\mathbb{O}blackboard_O. The first equality follows because we applied Cauchy-Schwarz inequality and since Δ¯d¯Δsuperscript𝑑\bar{\Delta}\in\mathbb{R}^{d}over¯ start_ARG roman_Δ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is free, there is some Δ¯¯Δ\bar{\Delta}over¯ start_ARG roman_Δ end_ARG that achieves equality. The rest of the proof follows exactly along the proof of Theorem 1 by completing the optimization over the dual problem using Proposition 1.∎

Appendix C Proof of Results in Classification.

Lemma A3.

Consider the convex function hβ(x):d:subscript𝛽𝑥superscript𝑑h_{\beta}(x):\mathbb{R}^{d}\to\mathbb{R}italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R by xdlog(1+exp(β𝖳x))𝑥superscript𝑑maps-to1superscript𝛽𝖳𝑥x\in\mathbb{R}^{d}\mapsto\log{(1+\exp{(\shortminus\beta^{\scriptscriptstyle\sfT%}x)})}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ roman_log ( 1 + roman_exp ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) ), for some q>0𝑞0q>0italic_q > 0 and xsuperscript𝑥x^{\prime}\in\mathbb{R}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R. Then for every γ>0𝛾0\gamma>0italic_γ > 0, the constraint optimization problem Hβ(x)subscript𝐻𝛽superscript𝑥H_{\beta}(x^{\prime})italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) defined as,

supxpsubscriptsupremum𝑥superscript𝑝\displaystyle\sup_{x\in\mathbb{R}^{p}}\quadroman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPThβ(x)γxxq,subscript𝛽𝑥𝛾subscriptdelimited-∥∥superscript𝑥𝑥𝑞\displaystyle h_{\beta}(x)-\gamma\lVert x^{\prime}-x\rVert_{q},italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x ) - italic_γ ∥ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ,
s.tformulae-sequencest\displaystyle\rm s.t\quadroman_s . roman_tθ𝖳(xx)=0,superscript𝜃𝖳superscript𝑥𝑥0\displaystyle\theta^{\scriptscriptstyle\sf T}(x^{\prime}-x)=0,italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ) = 0 ,

has optimal objective value,

Hβ(x)={hβ(x)ifinfκβκθpγ,+otherwise,subscript𝐻𝛽superscript𝑥casessubscript𝛽superscript𝑥ifsubscriptinfimum𝜅subscriptdelimited-∥∥𝛽𝜅𝜃𝑝𝛾otherwise,H_{\beta}(x^{\prime})=\begin{cases}h_{\beta}(x^{\prime})&\text{if }\inf_{%\kappa\in\mathbb{R}}\lVert\beta-\kappa\theta\rVert_{p}\leq\gamma,\\+\infty&\text{otherwise,}\end{cases}italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_κ ∈ blackboard_R end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise, end_CELL end_ROW

where p,q[1,)𝑝𝑞1p,q\in[1,\infty)italic_p , italic_q ∈ [ 1 , ∞ ) with p1+q1=1superscript𝑝1superscript𝑞11p^{-1}+q^{-1}=1italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1.

Proof.

This lemma is a simple extension of (Shafieezadeh-Abadeh etal.,, 2015, Lemma 1). Following their proof, it is shown that

hβ(x)=hβ(x)=sup0α1((αβ)𝖳xh¯(α)),subscript𝛽𝑥superscriptsubscript𝛽absent𝑥subscriptsupremum0𝛼1superscript𝛼𝛽𝖳𝑥superscript¯𝛼h_{\beta}(x)=h_{\beta}^{**}(x)=\sup_{0\leq\alpha\leq 1}\big{(}(\alpha\beta)^{%\scriptscriptstyle\sf T}x-\bar{h}^{*}(\alpha)\big{)},italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x ) = italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_x ) = roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) ,

where

h¯(α)={αlog(α)+(1α)log(1α)ifα[0,1],+otherwise,superscript¯𝛼cases𝛼𝛼1𝛼1𝛼if𝛼01otherwise\bar{h}^{*}(\alpha)=\begin{cases}\alpha\log{(\alpha)}+(1-\alpha)\log{(1-\alpha%)}&\text{if }\alpha\in[0,1],\\+\infty&\text{otherwise},\end{cases}over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) = { start_ROW start_CELL italic_α roman_log ( italic_α ) + ( 1 - italic_α ) roman_log ( 1 - italic_α ) end_CELL start_CELL if italic_α ∈ [ 0 , 1 ] , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise , end_CELL end_ROW

is the convex conjugate of the function log(1+ex)1superscript𝑒𝑥\log{\big{(}1+e^{\shortminus x}\big{)}}roman_log ( 1 + italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) (Proposition A2). Then it is shown that the objective Hβsubscript𝐻𝛽H_{\beta}italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT must has representation

sup0α1infqpγsupxsubscriptsupremum0𝛼1subscriptinfimumsubscriptnorm𝑞𝑝𝛾subscriptsupremum𝑥\displaystyle\sup_{0\leq\alpha\leq 1}\inf_{\|q\|_{p}\leq\gamma}\sup_{x}roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT ∥ italic_q ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT((αβ+q)𝖳xh¯(α)q𝖳x),superscript𝛼𝛽𝑞𝖳𝑥superscript¯𝛼superscript𝑞𝖳superscript𝑥\displaystyle\big{(}(\alpha\beta+q)^{\scriptscriptstyle\sf T}x-\bar{h}^{*}(%\alpha)-q^{\scriptscriptstyle\sf T}x^{\prime}\big{)},( ( italic_α italic_β + italic_q ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) - italic_q start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
s.tθ𝖳(xx)=0.s.tsuperscript𝜃𝖳𝑥superscript𝑥0\displaystyle\text{s.t}\,\,\,\theta^{\scriptscriptstyle\sf T}(x-x^{\prime})=0.s.t italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 .

Fixing α𝛼\alphaitalic_α and q𝑞qitalic_q, then the inner maximization in x𝑥xitalic_x

supxsubscriptsupremum𝑥\displaystyle\sup_{x}roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT((αβ+q)𝖳xq𝖳x),superscript𝛼𝛽𝑞𝖳𝑥superscript𝑞𝖳superscript𝑥\displaystyle\big{(}(\alpha\beta+q)^{\scriptscriptstyle\sf T}x-q^{%\scriptscriptstyle\sf T}x^{\prime}\big{)},( ( italic_α italic_β + italic_q ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_q start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
s.tθ𝖳(xx)=0,s.tsuperscript𝜃𝖳𝑥superscript𝑥0\displaystyle\text{s.t}\,\,\,\theta^{\scriptscriptstyle\sf T}(x-x^{\prime})=0,s.t italic_θ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 ,

has solution (αβ)𝖳xsuperscript𝛼𝛽𝖳superscript𝑥(\alpha\beta)^{\scriptscriptstyle\sf T}x^{\prime}( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT subject to αβ+q=μθ𝛼𝛽𝑞𝜇𝜃\alpha\beta+q=\mu\thetaitalic_α italic_β + italic_q = italic_μ italic_θ for some μ𝜇\mu\in\mathbb{R}italic_μ ∈ blackboard_R derived using the first-order condition of the Lagrangian duality or ++\infty+ ∞ otherwise. Then condition on {αβ+q=μθ|μ}conditional-set𝛼𝛽𝑞𝜇𝜃𝜇\{\alpha\beta+q=\mu\theta|\mu\in\mathbb{R}\}{ italic_α italic_β + italic_q = italic_μ italic_θ | italic_μ ∈ blackboard_R }, the objective has representation

Hβ(x)subscript𝐻𝛽superscript𝑥\displaystyle H_{\beta}(x^{\prime})italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )=sup0α1infqpγ((αβ)𝖳xh¯(α))s.tq=μθαβabsentsubscriptsupremum0𝛼1subscriptinfimumsubscriptnorm𝑞𝑝𝛾superscript𝛼𝛽𝖳superscript𝑥superscript¯𝛼s.t𝑞𝜇𝜃𝛼𝛽\displaystyle=\sup_{0\leq\alpha\leq 1}\inf_{\|q\|_{p}\leq\gamma}\big{(}(\alpha%\beta)^{\scriptscriptstyle\sf T}x^{\prime}-\bar{h}^{*}(\alpha)\big{)}\,\,\,%\text{s.t}\,\,\,q=\mu\theta-\alpha\beta= roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT ∥ italic_q ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) s.t italic_q = italic_μ italic_θ - italic_α italic_β
=sup0α1infμ,μθαβpγ((αβ)𝖳xh¯(α)).absentsubscriptsupremum0𝛼1subscriptinfimum𝜇subscriptnorm𝜇𝜃𝛼𝛽𝑝𝛾superscript𝛼𝛽𝖳superscript𝑥superscript¯𝛼\displaystyle=\sup_{0\leq\alpha\leq 1}\inf_{\mu,\|\mu\theta-\alpha\beta\|_{p}%\leq\gamma}\big{(}(\alpha\beta)^{\scriptscriptstyle\sf T}x^{\prime}-\bar{h}^{*%}(\alpha)\big{)}.= roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_μ , ∥ italic_μ italic_θ - italic_α italic_β ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) .

Consider the constraint μθαβpγsubscriptnorm𝜇𝜃𝛼𝛽𝑝𝛾\|\mu\theta-\alpha\beta\|_{p}\leq\gamma∥ italic_μ italic_θ - italic_α italic_β ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ over μ𝜇\muitalic_μ. Suppose α>0𝛼0\alpha>0italic_α > 0, then dividing by α𝛼-\alpha- italic_α, we get the equivalent constraint {|α|βμαθp}γ𝛼subscriptnorm𝛽𝜇𝛼𝜃𝑝𝛾\left\{|\alpha|\left\|\beta-\dfrac{\mu}{\alpha}\theta\right\|_{p}\right\}\leq\gamma{ | italic_α | ∥ italic_β - divide start_ARG italic_μ end_ARG start_ARG italic_α end_ARG italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } ≤ italic_γ over μ𝜇\muitalic_μ. Defining the change of variable κμα𝜅𝜇𝛼\kappa\coloneqq\dfrac{\mu}{\alpha}italic_κ ≔ divide start_ARG italic_μ end_ARG start_ARG italic_α end_ARG, then since the Lagrange multiplier μ𝜇\mu\in\mathbb{R}italic_μ ∈ blackboard_R is free, we have κ𝜅\kappaitalic_κ is free, and the constraint becomes infκ|α|βκθpγsubscriptinfimum𝜅𝛼subscriptnorm𝛽𝜅𝜃𝑝𝛾\inf_{\kappa}|\alpha|\|\beta-\kappa\theta\|_{p}\leq\gammaroman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT | italic_α | ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ over κ𝜅\kappa\in\mathbb{R}italic_κ ∈ blackboard_R. If α=0𝛼0\alpha=0italic_α = 0, then infμμθ0=0=0infκβκθpsubscriptinfimum𝜇norm𝜇𝜃000subscriptinfimum𝜅subscriptnorm𝛽𝜅𝜃𝑝\inf_{\mu}\|\mu\theta-0\|=0=0\cdot\inf_{\kappa}\|\beta-\kappa\theta\|_{p}roman_inf start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_μ italic_θ - 0 ∥ = 0 = 0 ⋅ roman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. So the equivalent constraint infκ|α|βκθpγsubscriptinfimum𝜅𝛼subscriptnorm𝛽𝜅𝜃𝑝𝛾\inf_{\kappa}|\alpha|\|\beta-\kappa\theta\|_{p}\leq\gammaroman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT | italic_α | ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ is valid for all α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. Then condition on {αβ+q=μθ|μ}conditional-set𝛼𝛽𝑞𝜇𝜃𝜇\{\alpha\beta+q=\mu\theta|\mu\in\mathbb{R}\}{ italic_α italic_β + italic_q = italic_μ italic_θ | italic_μ ∈ blackboard_R }, the objective becomes,

Hβ(x)subscript𝐻𝛽superscript𝑥\displaystyle H_{\beta}(x^{\prime})italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )=sup0α1((αβ)𝖳xh¯(α))s.tsup0α1|α|infκβκθpγ,absentsubscriptsupremum0𝛼1superscript𝛼𝛽𝖳superscript𝑥superscript¯𝛼s.tsubscriptsupremum0𝛼1𝛼subscriptinfimum𝜅subscriptnorm𝛽𝜅𝜃𝑝𝛾\displaystyle=\sup_{0\leq\alpha\leq 1}\big{(}(\alpha\beta)^{\scriptscriptstyle%\sf T}x^{\prime}-\bar{h}^{*}(\alpha)\big{)}\text{ s.t }\sup_{0\leq\alpha\leq 1%}|\alpha|\inf_{\kappa}\|\beta-\kappa\theta\|_{p}\leq\gamma,= roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) s.t roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT | italic_α | roman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ ,
=sup0α1((αβ)𝖳xh¯(α))s.tinfκβκθpγ,absentsubscriptsupremum0𝛼1superscript𝛼𝛽𝖳superscript𝑥superscript¯𝛼s.tsubscriptinfimum𝜅subscriptnorm𝛽𝜅𝜃𝑝𝛾\displaystyle=\sup_{0\leq\alpha\leq 1}\big{(}(\alpha\beta)^{\scriptscriptstyle%\sf T}x^{\prime}-\bar{h}^{*}(\alpha)\big{)}\text{ s.t }\inf_{\kappa}\|\beta-%\kappa\theta\|_{p}\leq\gamma,= roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) s.t roman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ ,

Recognizing that

sup0α1((αβ)𝖳xh¯(α))=sup0α1α(β𝖳x)h¯(α)=h¯(β𝖳x)=hβ(x),subscriptsupremum0𝛼1superscript𝛼𝛽𝖳superscript𝑥superscript¯𝛼subscriptsupremum0𝛼1𝛼superscript𝛽𝖳superscript𝑥superscript¯𝛼superscript¯absentsuperscript𝛽𝖳superscript𝑥subscript𝛽superscript𝑥\sup_{0\leq\alpha\leq 1}\big{(}(\alpha\beta)^{\scriptscriptstyle\sf T}x^{%\prime}-\bar{h}^{*}(\alpha)\big{)}=\sup_{0\leq\alpha\leq 1}\alpha(\beta^{%\scriptscriptstyle\sf T}x^{\prime})-\bar{h}^{*}(\alpha)=\bar{h}^{**}(\beta^{%\scriptscriptstyle\sf T}x^{\prime})=h_{\beta}(x^{\prime}),roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( ( italic_α italic_β ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) = roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT italic_α ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) = over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

we get

Hβ(x)={hβ(x)ifinfκβκθpγ,+otherwise.subscript𝐻𝛽superscript𝑥casessubscript𝛽superscript𝑥ifsubscriptinfimum𝜅subscriptnorm𝛽𝜅𝜃𝑝𝛾otherwise.H_{\beta}(x^{\prime})=\begin{cases}h_{\beta}(x^{\prime})&\text{if }\inf_{%\kappa}\|\beta-\kappa\theta\|_{p}\leq\gamma,\\+\infty&\text{otherwise.}\end{cases}italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∥ italic_β - italic_κ italic_θ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

The above Lemma A3 is easily generalized to incorporate multiple orthogonality constraints {θm𝖳(xx)=0;m[M]}formulae-sequencesuperscriptsubscript𝜃𝑚𝖳superscript𝑥𝑥0𝑚delimited-[]𝑀\{\theta_{m}^{\scriptscriptstyle\sf T}(x^{\prime}-x)=0\,;m\in[M]\}{ italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ) = 0 ; italic_m ∈ [ italic_M ] } using the exact same Lagrangian formulation. Again, recall Θ=span{θ1,,θM}Θspansubscript𝜃1subscript𝜃𝑀\Theta=\operatorname{span}{\{\theta_{1},\ldots,\theta_{M}\}}roman_Θ = roman_span { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. Thus the optimal objective value under multiple constraints becomes

Hβ(x)={hβ(x)ifinfϑΘβϑpγ,+otherwise.subscript𝐻𝛽superscript𝑥casessubscript𝛽superscript𝑥ifsubscriptinfimumitalic-ϑΘsubscriptnorm𝛽italic-ϑ𝑝𝛾otherwise.H_{\beta}(x^{\prime})=\begin{cases}h_{\beta}(x^{\prime})&\text{if }\inf_{%\vartheta\in\Theta}\|\beta-\vartheta\|_{p}\leq\gamma,\\+\infty&\text{otherwise.}\end{cases}italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

We now give the proof to Theorem 3.

Proof of Theorem 3 for Logistic Loss.

Using Proposition 1, we apply the strong duality, and consider the inner optimization problem

sup:𝒟c1,(,n)δ𝔼[log(1+eYβ𝖳X)]subscriptsupremum:subscript𝒟subscript𝑐1subscript𝑛𝛿subscript𝔼delimited-[]1superscript𝑒𝑌superscript𝛽𝖳𝑋\displaystyle\sup_{\mathbb{P}:\mathcal{D}_{c_{1},\infty}(\mathbb{P},\mathbb{P}%_{n})\leq\delta}\mathbb{E}_{\mathbb{P}}\left[\log{\left(1+e^{-Y\beta^{%\scriptscriptstyle\sf T}X}\right)}\right]roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_D start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∞ end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) ]
=\displaystyle=={infγ0{γδ+1ni=1Nsupud(log(1+eyiβ𝖳u)γxiuq)},s.tθm𝖳(xiu)=0,for allm[M]andi[N].casessubscriptinfimum𝛾0𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscriptsupremum𝑢superscript𝑑1superscript𝑒subscript𝑦𝑖superscript𝛽𝖳𝑢𝛾subscriptnormsubscript𝑥𝑖𝑢𝑞otherwiseformulae-sequences.tsuperscriptsubscript𝜃𝑚𝖳subscript𝑥𝑖𝑢0for all𝑚delimited-[]𝑀and𝑖delimited-[]𝑁otherwise\displaystyle\begin{cases}\inf_{\gamma\geq 0}\left\{\gamma\delta+\dfrac{1}{n}%\sum_{i=1}^{N}\sup_{u\in\mathbb{R}^{d}}\left(\log{\left(1+e^{-y_{i}\beta^{%\scriptscriptstyle\sf T}u}\right)}-\gamma\|x_{i}-u\|_{q}\right)\right\},&\\\text{s.t }\,\theta_{m}^{\scriptscriptstyle\sf T}(x_{i}-u)=0,\text{ for all }m%\in[M]\text{ and }i\in[N].\end{cases}{ start_ROW start_CELL roman_inf start_POSTSUBSCRIPT italic_γ ≥ 0 end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) - italic_γ ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL s.t italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ) = 0 , for all italic_m ∈ [ italic_M ] and italic_i ∈ [ italic_N ] . end_CELL start_CELL end_CELL end_ROW

For each i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], we apply Lemma A3 to the maximization problem,

Hβ(xi)={supud(log(1+eyiβ𝖳u)γxiuq),s.tθm𝖳(xiu)=0,for allm[M].subscript𝐻𝛽subscript𝑥𝑖casessubscriptsupremum𝑢superscript𝑑1superscript𝑒subscript𝑦𝑖superscript𝛽𝖳𝑢𝛾subscriptnormsubscript𝑥𝑖𝑢𝑞otherwiseformulae-sequences.tsuperscriptsubscript𝜃𝑚𝖳subscript𝑥𝑖𝑢0for all𝑚delimited-[]𝑀otherwiseH_{\beta}(x_{i})=\begin{cases}\sup_{u\in\mathbb{R}^{d}}\left(\log{\left(1+e^{-%y_{i}\beta^{\scriptscriptstyle\sf T}u}\right)}-\gamma\|x_{i}-u\|_{q}\right),&%\\\text{s.t }\,\theta_{m}^{\scriptscriptstyle\sf T}(x_{i}-u)=0,\text{ for all }m%\in[M].\end{cases}italic_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_sup start_POSTSUBSCRIPT italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) - italic_γ ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL s.t italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ) = 0 , for all italic_m ∈ [ italic_M ] . end_CELL start_CELL end_CELL end_ROW

which has solution

{log(1+eyiβ𝖳xi)ifinfϑΘβϑpγ,+otherwise.cases1superscript𝑒subscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖ifsubscriptinfimumitalic-ϑΘsubscriptnorm𝛽italic-ϑ𝑝𝛾otherwise.\begin{cases}\log{\left(1+e^{-y_{i}\beta^{\scriptscriptstyle\sf T}x_{i}}\right%)}&\text{if }\inf_{\vartheta\in\Theta}\|\beta-\vartheta\|_{p}\leq\gamma,\\+\infty&\text{otherwise.}\end{cases}{ start_ROW start_CELL roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

Therefore, the maximization problem sup:𝒟c1,(,n)δ𝔼[log(1+eYβ𝖳X)]subscriptsupremum:subscript𝒟subscript𝑐1subscript𝑛𝛿subscript𝔼delimited-[]1superscript𝑒𝑌superscript𝛽𝖳𝑋\sup_{\mathbb{P}:\mathcal{D}_{c_{1},\infty}(\mathbb{P},\mathbb{P}_{n})\leq%\delta}\mathbb{E}_{\mathbb{P}}\left[\log{\left(1+e^{-Y\beta^{%\scriptscriptstyle\sf T}X}\right)}\right]roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_D start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∞ end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) ] is bounded from above if and only if γinfϑβϑp𝛾subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝\gamma\geq\inf_{\vartheta}\|\beta-\vartheta\|_{p}italic_γ ≥ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Under this condition, this reduces the inner optimization problem,

sup:𝒟c1,(,n)δ𝔼[log(1+eYβ𝖳X)]subscriptsupremum:subscript𝒟subscript𝑐1subscript𝑛𝛿subscript𝔼delimited-[]1superscript𝑒𝑌superscript𝛽𝖳𝑋\displaystyle\sup_{\mathbb{P}:\mathcal{D}_{c_{1},\infty}(\mathbb{P},\mathbb{P}%_{n})\leq\delta}\mathbb{E}_{\mathbb{P}}\left[\log{\left(1+e^{-Y\beta^{%\scriptscriptstyle\sf T}X}\right)}\right]roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_D start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∞ end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) ]=infγinfϑβϑp{γδ+1ni=1Nlog(1+eyiβ𝖳xi)}absentsubscriptinfimum𝛾subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁1superscript𝑒subscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖\displaystyle=\inf_{\gamma\geq\inf_{\vartheta}\|\beta-\vartheta\|_{p}}\left\{%\gamma\delta+\dfrac{1}{n}\sum_{i=1}^{N}\log{\left(1+e^{-y_{i}\beta^{%\scriptscriptstyle\sf T}x_{i}}\right)}\right\}= roman_inf start_POSTSUBSCRIPT italic_γ ≥ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) }
=1ni=1Nlog(1+eyiβ𝖳xi)+δinfϑβϑp.absent1𝑛superscriptsubscript𝑖1𝑁1superscript𝑒subscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖𝛿subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝\displaystyle=\dfrac{1}{n}\sum_{i=1}^{N}\log{\left(1+e^{-y_{i}\beta^{%\scriptscriptstyle\sf T}x_{i}}\right)}+\delta\inf_{\vartheta}\|\beta-\vartheta%\|_{p}.= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) + italic_δ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT .

This concludes the proof.∎

We now give the proof to the maximum margin classifier using the hinge loss.

Proof of Theorem 3 for Hinge Loss.

As in the case to the proof of Theorem 1, we first consider the relaxed cost function

c1((x,y),(u,v))=xuq+|yv|+λm=1M|θm𝖳xθm𝖳u|,subscript𝑐1𝑥𝑦𝑢𝑣subscriptnorm𝑥𝑢𝑞𝑦𝑣𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳𝑥superscriptsubscript𝜃𝑚𝖳𝑢c_{1}((x,y),(u,v))=\|x-u\|_{q}+\infty\cdot|y-v|+\lambda\cdot\sum_{m=1}^{M}|%\theta_{m}^{\scriptscriptstyle\sf T}x-\theta_{m}^{\scriptscriptstyle\sf T}u|,italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) = ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + ∞ ⋅ | italic_y - italic_v | + italic_λ ⋅ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u | ,

where we relaxed the transferring strength from ++\infty+ ∞ to some finite value λ>0𝜆0\lambda>0italic_λ > 0. We will then let λ+𝜆\lambda\to+\inftyitalic_λ → + ∞. Again, by strong duality, we can solve the worst case hinge loss by solving the dual problem

infγ0{γδ+1ni=1Nsupu((1yiβ𝖳u)+γuxiqγλm=1M|θm𝖳(xiu)|)}.subscriptinfimum𝛾0𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscriptsupremum𝑢superscript1subscript𝑦𝑖superscript𝛽𝖳𝑢𝛾subscriptnorm𝑢subscript𝑥𝑖𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳subscript𝑥𝑖𝑢\inf_{\gamma\geq 0}\left\{\gamma\delta+\dfrac{1}{n}\sum_{i=1}^{N}\sup_{u}\left%((1-y_{i}\beta^{\scriptscriptstyle\sf T}u)^{+}-\gamma\|u-x_{i}\|_{q}-\gamma%\lambda\sum_{m=1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T}(x_{i}-u)|\right)%\right\}.roman_inf start_POSTSUBSCRIPT italic_γ ≥ 0 end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ∥ italic_u - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ) | ) } .

Let ΔuxΔ𝑢𝑥\Delta\coloneqq u-xroman_Δ ≔ italic_u - italic_x, then we have

supu((1yβ𝖳u)+γuxqγλm=1M|θm𝖳(xu)|)subscriptsupremum𝑢superscript1𝑦superscript𝛽𝖳𝑢𝛾subscriptnorm𝑢𝑥𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳𝑥𝑢\displaystyle\sup_{u}\left((1-y\beta^{\scriptscriptstyle\sf T}u)^{+}-\gamma\|u%-x\|_{q}-\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T}(x-u)%|\right)roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ∥ italic_u - italic_x ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x - italic_u ) | )
=\displaystyle==supΔ((1yβ𝖳(Δ+x))+γΔqγλm=1M|θm𝖳Δ|)subscriptsupremumΔsuperscript1𝑦superscript𝛽𝖳Δ𝑥𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δ\displaystyle\sup_{\Delta}\left((1-y\beta^{\scriptscriptstyle\sf T}(\Delta+x))%^{+}-\gamma\|\Delta\|_{q}-\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^{%\scriptscriptstyle\sf T}\Delta|\right)roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( roman_Δ + italic_x ) ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | )
=\displaystyle==supΔsup0α1(α(1yβ𝖳(Δ+x))γΔqγλm=1M|θm𝖳Δ|)subscriptsupremumΔsubscriptsupremum0𝛼1𝛼1𝑦superscript𝛽𝖳Δ𝑥𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δ\displaystyle\sup_{\Delta}\sup_{0\leq\alpha\leq 1}\left(\alpha(1-y\beta^{%\scriptscriptstyle\sf T}(\Delta+x))-\gamma\|\Delta\|_{q}-\gamma\lambda\sum_{m=%1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T}\Delta|\right)roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( italic_α ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( roman_Δ + italic_x ) ) - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | )
=\displaystyle==sup0α1supΔ(αyβ𝖳ΔγΔqγλm=1M|θm𝖳Δ|+α(1yβ𝖳x)).subscriptsupremum0𝛼1subscriptsupremumΔ𝛼𝑦superscript𝛽𝖳Δ𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δ𝛼1𝑦superscript𝛽𝖳𝑥\displaystyle\sup_{0\leq\alpha\leq 1}\sup_{\Delta}\left(-\alpha y\beta^{%\scriptscriptstyle\sf T}\Delta-\gamma\|\Delta\|_{q}-\gamma\lambda\sum_{m=1}^{M%}|\theta_{m}^{\scriptscriptstyle\sf T}\Delta|+\alpha(1-y\beta^{%\scriptscriptstyle\sf T}x)\right).roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( - italic_α italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | + italic_α ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) ) .

Where in the second equality we used x+=sup0α1αxsuperscript𝑥subscriptsupremum0𝛼1𝛼𝑥x^{+}=\sup_{0\leq\alpha\leq 1}\alpha xitalic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT italic_α italic_x. Fixing α𝛼\alphaitalic_α, consider the inner minimization in ΔΔ\Deltaroman_Δ,

supΔ(αyβ𝖳ΔγΔqγλm=1M|θm𝖳Δ|)=g(αyβ),subscriptsupremumΔ𝛼𝑦superscript𝛽𝖳Δ𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δsuperscript𝑔𝛼𝑦𝛽\displaystyle\sup_{\Delta}\left(-\alpha y\beta^{\scriptscriptstyle\sf T}\Delta%-\gamma\|\Delta\|_{q}-\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^{%\scriptscriptstyle\sf T}\Delta|\right)=g^{*}(-\alpha y\beta),roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( - italic_α italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | ) = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( - italic_α italic_y italic_β ) ,

where g(Δ)superscript𝑔superscriptΔg^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the convex conjugate of g(Δ)γΔq+γλm=1M|θm𝖳Δ|𝑔Δ𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δg(\Delta)\coloneqq\gamma\|\Delta\|_{q}+\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^%{\scriptscriptstyle\sf T}\Delta|italic_g ( roman_Δ ) ≔ italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ |. Set γΔ1qg1(Δ1)𝛾subscriptnormsubscriptΔ1𝑞subscript𝑔1subscriptΔ1\gamma\|\Delta_{1}\|_{q}\eqqcolon g_{1}(\Delta_{1})italic_γ ∥ roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ≕ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and γλm=1M|θm𝖳Δ2|g2(Δ2)𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳subscriptΔ2subscript𝑔2subscriptΔ2\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T}\Delta_{2}|%\eqqcolon g_{2}(\Delta_{2})italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ≕ italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), then by the infimal convolution property of convex conjugates (Theorem A5), we know that

g(Δ)=infΔ1+Δ2=Δ(g1(Δ1)+g2(Δ2)).superscript𝑔superscriptΔsubscriptinfimumsubscriptsuperscriptΔ1subscriptsuperscriptΔ2superscriptΔsuperscriptsubscript𝑔1subscriptsuperscriptΔ1subscriptsuperscript𝑔2subscriptsuperscriptΔ2g^{*}(\Delta^{*})=\inf_{\Delta^{*}_{1}+\Delta^{*}_{2}=\Delta^{*}}\big{(}g_{1}^%{*}(\Delta^{*}_{1})+g^{*}_{2}(\Delta^{*}_{2})\big{)}.italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_inf start_POSTSUBSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) .

From Lemma A2, we know that if g(Δ)superscript𝑔superscriptΔg^{*}(\Delta^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is finite, then g2(Δ2)=0superscriptsubscript𝑔2superscriptsubscriptΔ20g_{2}^{*}(\Delta_{2}^{*})=0italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 subject to Δ2=m=1MαmθmsuperscriptsubscriptΔ2superscriptsubscript𝑚1𝑀subscript𝛼𝑚subscript𝜃𝑚\Delta_{2}^{*}=\sum_{m=1}^{M}\alpha_{m}\theta_{m}roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and |αm|λγsubscript𝛼𝑚𝜆𝛾|\alpha_{m}|\leq\lambda\gamma| italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_λ italic_γ for all m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ]. Now it is well known that (Proposition A2),

g1(Δ1)=(γq)(Δ1)=I{Δ1pγ}(Δ1),g_{1}^{*}(\Delta_{1}^{*})=(\gamma\|\,\cdot\,\|_{q})^{*}(\Delta_{1}^{*})=I_{\{%\|\Delta_{1}^{*}\|_{p}\leq\gamma\}}(\Delta_{1}^{*}),italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( italic_γ ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_I start_POSTSUBSCRIPT { ∥ roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ } end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

where IC(x)subscript𝐼𝐶𝑥I_{C}(x)italic_I start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) denotes the convex indicator on the set C𝐶Citalic_C. Therefore, letting λ𝜆\lambda\to\inftyitalic_λ → ∞, the constraints {|αm|λγ|m[M]}subscript𝛼𝑚conditional𝜆𝛾𝑚delimited-[]𝑀\{|\alpha_{m}|\leq\lambda\gamma|m\in[M]\}{ | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ italic_λ italic_γ | italic_m ∈ [ italic_M ] } is redundant, and we have

g(Δ)={0ifinfϑΘΔϑpγ,+otherwise,superscript𝑔superscriptΔcases0ifsubscriptinfimumitalic-ϑΘsubscriptnormsuperscriptΔitalic-ϑ𝑝𝛾otherwise,g^{*}(\Delta^{*})=\begin{cases}0&\text{if }\inf_{\vartheta\in\Theta}\|\Delta^{%*}-\vartheta\|_{p}\leq\gamma,\\+\infty&\text{otherwise,}\end{cases}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL if roman_inf start_POSTSUBSCRIPT italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise, end_CELL end_ROW

where we let Θspan{θ1,,θM}Θspansubscript𝜃1subscript𝜃𝑀\Theta\coloneqq\operatorname{span}{\{\theta_{1},\ldots,\theta_{M}\}}roman_Θ ≔ roman_span { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. Therefore, g(αyβ)superscript𝑔𝛼𝑦𝛽g^{*}(-\alpha y\beta)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( - italic_α italic_y italic_β ) is finite if and only if infϑαyβϑpγsubscriptinfimumitalic-ϑsubscriptnorm𝛼𝑦𝛽italic-ϑ𝑝𝛾\inf_{\vartheta}\|-\alpha y\beta-\vartheta\|_{p}\leq\gammaroman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ - italic_α italic_y italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ. Now y=±1𝑦plus-or-minus1y=\pm 1italic_y = ± 1, so we can remove y𝑦-y- italic_y, and this leaves us the condition that infϑαβϑpγsubscriptinfimumitalic-ϑsubscriptnorm𝛼𝛽italic-ϑ𝑝𝛾\inf_{\vartheta}\|\alpha\beta-\vartheta\|_{p}\leq\gammaroman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_α italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ, which is equivalent to αinfϑβϑpγ𝛼subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝𝛾\alpha\inf_{\vartheta}\|\beta-\vartheta\|_{p}\leq\gammaitalic_α roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ for all α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ], including α=0𝛼0\alpha=0italic_α = 0. Taking supremum over α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ], the final condition is infϑβϑpγsubscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝𝛾\inf_{\vartheta}\|\beta-\vartheta\|_{p}\leq\gammaroman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ. Therefore, assuming the dual problem is bounded from above, it reduces as

sup0α1supΔ(αyβ𝖳ΔγΔqγλm=1M|θm𝖳Δ|+α(1yβ𝖳x))subscriptsupremum0𝛼1subscriptsupremumΔ𝛼𝑦superscript𝛽𝖳Δ𝛾subscriptnormΔ𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳Δ𝛼1𝑦superscript𝛽𝖳𝑥\displaystyle\sup_{0\leq\alpha\leq 1}\sup_{\Delta}\left(-\alpha y\beta^{%\scriptscriptstyle\sf T}\Delta-\gamma\|\Delta\|_{q}-\gamma\lambda\sum_{m=1}^{M%}|\theta_{m}^{\scriptscriptstyle\sf T}\Delta|+\alpha(1-y\beta^{%\scriptscriptstyle\sf T}x)\right)roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( - italic_α italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ - italic_γ ∥ roman_Δ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT roman_Δ | + italic_α ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) )
=\displaystyle==sup0α1(I{infϑβϑpγ}+α(1yβ𝖳x))subscriptsupremum0𝛼1subscript𝐼conditional-setsubscriptinfimumitalic-ϑ𝛽evaluated-atitalic-ϑ𝑝𝛾𝛼1𝑦superscript𝛽𝖳𝑥\displaystyle\sup_{0\leq\alpha\leq 1}\left(I_{\{\inf_{\vartheta}\|\beta-%\vartheta\|_{p}\leq\gamma\}}+\alpha(1-y\beta^{\scriptscriptstyle\sf T}x)\right)roman_sup start_POSTSUBSCRIPT 0 ≤ italic_α ≤ 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT { roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ } end_POSTSUBSCRIPT + italic_α ( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) )
=\displaystyle==(1yβ𝖳x)+giveninfϑβϑpγ.superscript1𝑦superscript𝛽𝖳𝑥givensubscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝𝛾\displaystyle(1-y\beta^{\scriptscriptstyle\sf T}x)^{+}\quad\text{given}\quad%\inf_{\vartheta}\|\beta-\vartheta\|_{p}\leq\gamma.( 1 - italic_y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT given roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_γ .

Finally, the dual form of the distributionally robust optimization problem is

infβinfγ0{γδ+1ni=1Nsupu((1yiβ𝖳u)+γuxiqγλm=1M|θm𝖳(xiu)|)}subscriptinfimum𝛽subscriptinfimum𝛾0𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁subscriptsupremum𝑢superscript1subscript𝑦𝑖superscript𝛽𝖳𝑢𝛾subscriptnorm𝑢subscript𝑥𝑖𝑞𝛾𝜆superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳subscript𝑥𝑖𝑢\displaystyle\inf_{\beta}\inf_{\gamma\geq 0}\left\{\gamma\delta+\dfrac{1}{n}%\sum_{i=1}^{N}\sup_{u}\left((1-y_{i}\beta^{\scriptscriptstyle\sf T}u)^{+}-%\gamma\|u-x_{i}\|_{q}-\gamma\lambda\sum_{m=1}^{M}|\theta_{m}^{%\scriptscriptstyle\sf T}(x_{i}-u)|\right)\right\}roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_γ ≥ 0 end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ∥ italic_u - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_γ italic_λ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u ) | ) }
=\displaystyle==infβinfγinfϑβϑp{γδ+1ni=1N(1yiβ𝖳xi)+}subscriptinfimum𝛽subscriptinfimum𝛾subscriptinfimumitalic-ϑsubscriptnorm𝛽italic-ϑ𝑝𝛾𝛿1𝑛superscriptsubscript𝑖1𝑁superscript1subscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖\displaystyle\inf_{\beta}\inf_{\gamma\geq\inf_{\vartheta}\|\beta-\vartheta\|_{%p}}\left\{\gamma\delta+\dfrac{1}{n}\sum_{i=1}^{N}(1-y_{i}\beta^{%\scriptscriptstyle\sf T}x_{i})^{+}\right\}roman_inf start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_γ ≥ roman_inf start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_γ italic_δ + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT }
=\displaystyle==infβ,ϑ{1ni=1N(1yiβ𝖳xi)++δβϑp}.subscriptinfimum𝛽italic-ϑconditional-set1𝑛superscriptsubscript𝑖1𝑁superscript1subscript𝑦𝑖superscript𝛽𝖳subscript𝑥𝑖𝛿𝛽evaluated-atitalic-ϑ𝑝\displaystyle\inf_{\beta,\vartheta}\left\{\dfrac{1}{n}\sum_{i=1}^{N}(1-y_{i}%\beta^{\scriptscriptstyle\sf T}x_{i})^{+}+\delta\|\beta-\vartheta\|_{p}\right\}.roman_inf start_POSTSUBSCRIPT italic_β , italic_ϑ end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_δ ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } .

This completes the proof.∎

Appendix D Proof of Results in Mahalanobis Norm Regularization

Proof of Corollary 1.

This is a direct consequence of the convex conjugate of xΛ2superscriptsubscriptnorm𝑥Λ2\|x\|_{\Lambda}^{2}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT given in Proposition A4.∎

Define the cost function c1,Λ((x,y),(u,v))xuΛ+|yv|+m=1M|θm𝖳xθm𝖳u|superscriptsubscript𝑐1Λ𝑥𝑦𝑢𝑣subscriptnorm𝑥𝑢Λ𝑦𝑣superscriptsubscript𝑚1𝑀superscriptsubscript𝜃𝑚𝖳𝑥superscriptsubscript𝜃𝑚𝖳𝑢c_{1,\infty}^{\Lambda}\big{(}(x,y),(u,v)\big{)}\coloneqq\|x-u\|_{\Lambda}+%\infty\cdot|y-v|+\infty\cdot\sum_{m=1}^{M}|\theta_{m}^{\scriptscriptstyle\sf T%}x-\theta_{m}^{\scriptscriptstyle\sf T}u|italic_c start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT ( ( italic_x , italic_y ) , ( italic_u , italic_v ) ) ≔ ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT + ∞ ⋅ | italic_y - italic_v | + ∞ ⋅ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_x - italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u |.

Corollary 2 (Theorem 3).

Suppose the loss function (X,Y;β)𝑋𝑌𝛽\ell(X,Y;\beta)roman_ℓ ( italic_X , italic_Y ; italic_β ) is either the logistic loss log(1+eYβ𝖳X)1superscript𝑒𝑌superscript𝛽𝖳𝑋\log{\left(1+e^{-Y\beta^{\scriptscriptstyle\sf T}X}\right)}roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) or the hinge loss (1Yβ𝖳X)+superscript1𝑌superscript𝛽𝖳𝑋(1-Y\beta^{\scriptscriptstyle\sf T}X)^{+}( 1 - italic_Y italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, then for any Λ𝕊+d×dΛsuperscriptsubscript𝕊𝑑𝑑\Lambda\in\mathbb{S}_{+}^{d\times d}roman_Λ ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT we have

infβdsup:δ(NX;c1,Λ)𝔼[(X,Y;β)]subscriptinfimum𝛽superscript𝑑subscriptsupremum:subscript𝛿superscriptsubscript𝑁𝑋subscriptsuperscript𝑐Λ1subscript𝔼delimited-[]𝑋𝑌𝛽\displaystyle\inf_{\beta\in\mathbb{R}^{d}}\sup_{\mathbb{P}:\mathcal{B}_{\delta%}(\mathbb{P}_{N}^{X};c^{\Lambda}_{1,\infty})}\mathbb{E}_{\mathbb{P}}\left[\ell%(X,Y;\beta)\right]roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT blackboard_P : caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ; italic_c start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_X , italic_Y ; italic_β ) ]
=\displaystyle==infβd,ϑΘ1Ni=1N(xi,yi;β)+δβϑΛ1.subscriptinfimumformulae-sequence𝛽superscript𝑑italic-ϑΘ1𝑁superscriptsubscript𝑖1𝑁subscript𝑥𝑖subscript𝑦𝑖𝛽𝛿subscriptnorm𝛽italic-ϑsuperscriptΛ1\displaystyle\inf_{\beta\in\mathbb{R}^{d},\vartheta\in\Theta}\dfrac{1}{N}\sum_%{i=1}^{N}\ell(x_{i},y_{i};\beta)+\delta\|\beta-\vartheta\|_{\Lambda^{%\shortminus 1}}.roman_inf start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_ϑ ∈ roman_Θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_β ) + italic_δ ∥ italic_β - italic_ϑ ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .
Proof.

For the logistic loss case, this is a direct consequence of the dual norm of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT, for the hinge loss case this is a direct consequence of the convex conjugate of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT. Both given by Proposition A4.∎

Appendix E Useful Results on Convex Conjugation

In this section we review some results on the concept of convex conjugates that repeatedly come up in the proofs. For more details on convex conjugations, the interested readers can consult (Rockafellar,, 1970, Sections 12 & 16).

Definition A1 (Convex Conjugate).

Let f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R be a real-valued function on the Euclidean space, then the convex conjugate of f𝑓fitalic_f is the function f:d:superscript𝑓superscript𝑑f^{*}:\mathbb{R}^{d}\to\mathbb{R}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R that evaluates xnsuperscript𝑥superscript𝑛x^{*}\in\mathbb{R}^{n}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT by

f(x)=supxdom(f)(xx𝖳f(x)).superscript𝑓superscript𝑥subscriptsupremum𝑥dom𝑓superscript𝑥superscript𝑥𝖳𝑓𝑥f^{*}(x^{*})=\sup_{x\in\operatorname{\textbf{dom}}{(f)}}\big{(}x^{*}{{}^{%\scriptscriptstyle\sf T}}x-f(x)\big{)}.italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_x ∈ dom ( italic_f ) end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - italic_f ( italic_x ) ) .

This is also called the Legendre transformation of f𝑓fitalic_f, and the Legendre-Fenchel transformation for f𝑓fitalic_f defined on arbitrary real topological vector spaces. Here we collect some examples of convex conjugates that appeared in the appendix. These are well-known.

Proposition A2.

Let p,q1𝑝𝑞1p,q\geq 1italic_p , italic_q ≥ 1 be such that 1p+1q=11𝑝1𝑞1\dfrac{1}{p}+\dfrac{1}{q}=1divide start_ARG 1 end_ARG start_ARG italic_p end_ARG + divide start_ARG 1 end_ARG start_ARG italic_q end_ARG = 1.

  1. 1.

    The convex conjugate of the absolute value function f(x)=|x|𝑓𝑥𝑥f(x)=|x|italic_f ( italic_x ) = | italic_x | on \mathbb{R}blackboard_R is given by ||(x)=I|x|1(x)|\cdot|^{*}(x^{*})=I_{|x^{*}|\leq 1}(x^{*})| ⋅ | start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_I start_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), the convex indicator function on the set {|x|1|x}superscript𝑥conditional1superscript𝑥\{|x^{*}|\leq 1|x^{*}\in\mathbb{R}\}{ | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ 1 | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R }.

  2. 2.

    The convex conjugate of the q𝑞qitalic_q-norm xqsubscriptnorm𝑥𝑞\|x\|_{q}∥ italic_x ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is given by q(x)=Ixp1(x)\|\cdot\|_{q}^{*}(x^{*})=I_{\|x^{*}\|_{p}\leq 1}(x^{*})∥ ⋅ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_I start_POSTSUBSCRIPT ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), the convex indicator function on the set {xp1|xd}subscriptnormsuperscript𝑥𝑝conditional1superscript𝑥superscript𝑑\{\|x^{*}\|_{p}\leq 1|x^{*}\in\mathbb{R}^{d}\}{ ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ 1 | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }.

  3. 3.

    The convex conjugate of 12xq212superscriptsubscriptnorm𝑥𝑞2\dfrac{1}{2}\|x\|_{q}^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is given by (12q2)(x)=12xp2\left(\dfrac{1}{2}\|\cdot\|_{q}^{2}\right)^{*}(x^{*})=\dfrac{1}{2}\|x^{*}\|_{p%}^{2}( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

  4. 4.

    The convex conjugate of log(1+ex)1superscript𝑒𝑥\log{\big{(}1+e^{\shortminus x}\big{)}}roman_log ( 1 + italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) on \mathbb{R}blackboard_R is given by

    {xlog(x)+(1x)log(1x)ifx(0,1)0ifx=0,1+otherwise.casessuperscript𝑥superscript𝑥1superscript𝑥1superscript𝑥ifsuperscript𝑥010ifsuperscript𝑥01otherwise\begin{cases}x^{*}\log{(x^{*})}+(1-x^{*})\log{(1-x^{*})}&\text{if }x^{*}\in(0,%1)\\0&\text{if }x^{*}=0,1\\+\infty&\text{otherwise}.\end{cases}{ start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_log ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( 1 - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ ( 0 , 1 ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0 , 1 end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise . end_CELL end_ROW

Another easy consequence from the definition of convex conjugation is the below scaling laws.

Proposition A3 (Scaling Laws).

Let f(x)superscript𝑓superscript𝑥f^{*}(x^{*})italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be the convex conjugate of f(x)𝑓𝑥f(x)italic_f ( italic_x ) on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then we have,

  1. 1.

    the convex conjugate of f(ax)𝑓𝑎𝑥f(ax)italic_f ( italic_a italic_x ) whenever a0𝑎0a\neq 0italic_a ≠ 0 is given by f(x/a)superscript𝑓superscript𝑥𝑎f^{*}(x^{*}/a)italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT / italic_a ).

  2. 2.

    the convex conjugate of af(x)𝑎𝑓𝑥af(x)italic_a italic_f ( italic_x ) whenever a>0𝑎0a>0italic_a > 0 is given by af(x/a)𝑎superscript𝑓superscript𝑥𝑎af^{*}(x^{*}/a)italic_a italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT / italic_a ).

Let Γ(d)Γsuperscript𝑑\Gamma\left(\mathbb{R}^{d}\right)roman_Γ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) denote the class of proper convex lower-semi continuous functions on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the next statement says that this conjugation induces an one-to-one symmetric correspondence on the class Γ(d)Γsuperscript𝑑\Gamma\left(\mathbb{R}^{d}\right)roman_Γ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). It is a cornerstone of modern convex analysis and used in the proof of Theorem 1 and Lemma A3.

Theorem A4 (Fenchel-Moreau).

Let f𝑓fitalic_f be a proper convex, lower semi-continuous function on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, then

  1. 1.

    the convex conjugation ffmaps-to𝑓superscript𝑓f\mapsto f^{*}italic_f ↦ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a bijection on Γ(d)Γsuperscript𝑑\Gamma\left(\mathbb{R}^{d}\right)roman_Γ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT );

  2. 2.

    fΓ(d)f=fiff𝑓Γsuperscript𝑑superscript𝑓absent𝑓f\in\Gamma\left(\mathbb{R}^{d}\right)\iff f^{**}=fitalic_f ∈ roman_Γ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⇔ italic_f start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT = italic_f.

Proof.

For a proof please consult (Rockafellar,, 1970, Section 12).∎

The next statement concerns the commutativity of convex conjugation and function summation. Its usefulness is profound, and applied to the proof of Theorem 1 and Theorem 3.

Theorem A5 (Infimal Convolution Property of Convex Conjugation).

Let f1,,fMsubscript𝑓1subscript𝑓𝑀f_{1},\ldots,f_{M}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT be proper convex functions on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, then

(f1fM)=f1+fM,superscriptsubscript𝑓1subscript𝑓𝑀superscriptsubscript𝑓1superscriptsubscript𝑓𝑀(f_{1}\square\ldots\square f_{M})^{*}=f_{1}^{*}+\ldots f_{M}^{*},( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT □ … □ italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + … italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,

and

(f1++fM)(x)=infx{f1(x1)+fM(xM)|x1++xM=x}.superscriptsubscript𝑓1subscript𝑓𝑀superscript𝑥subscriptinfimumsuperscript𝑥conditional-setsuperscriptsubscript𝑓1superscriptsubscript𝑥1superscriptsubscript𝑓𝑀superscriptsubscript𝑥𝑀superscriptsubscript𝑥1superscriptsubscript𝑥𝑀superscript𝑥(f_{1}+\ldots+f_{M})^{*}(x^{*})=\inf_{x^{*}}\{f_{1}^{*}(x_{1}^{*})+\ldots f_{M%}^{*}(x_{M}^{*})\,|\,x_{1}^{*}+\ldots+x_{M}^{*}=x^{*}\}.( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_inf start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + … italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + … + italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } .
Proof.

For a proof please consult (Rockafellar,, 1970, Theorem 16.4).∎

Proposition A4.

Let Λ𝕊+d×dΛsuperscriptsubscript𝕊𝑑𝑑\Lambda\in\mathbb{S}_{+}^{d\times d}roman_Λ ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, then the dual norm of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT is xΛ1.subscriptnorm𝑥superscriptΛ1\|x\|_{\Lambda^{-1}}.∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . The Cauchy-Schwarz inequality x𝖳uxΛuΛ1superscript𝑥𝖳𝑢subscriptnorm𝑥Λsubscriptnorm𝑢superscriptΛ1x^{\scriptscriptstyle\sf T}u\leq\|x\|_{\Lambda}\|u\|_{\Lambda^{\shortminus 1}}italic_x start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_u ≤ ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ∥ italic_u ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT holds, and equality is attainable. The convex conjugate of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT is given by IxΛ11(x)subscript𝐼subscriptnormsuperscript𝑥superscriptΛ11superscript𝑥I_{\|x^{*}\|_{\Lambda^{\shortminus 1}}\leq 1}(x^{*})italic_I start_POSTSUBSCRIPT ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), and the convex conjugate of xΛ2superscriptsubscriptnorm𝑥Λ2\|x\|_{\Lambda}^{2}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is given by xΛ12/4superscriptsubscriptnormsuperscript𝑥superscriptΛ124\|x^{*}\|_{\Lambda^{\shortminus 1}}^{2}/4∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4.

Proof.

The dual norm of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT, the Cauchy-Schwarz inequality and attainability of equality follows from (Blanchet etal., 2019b, , Lemma 1). Now to compute the convex conjugate of xΛ2superscriptsubscriptnorm𝑥Λ2\|x\|_{\Lambda}^{2}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we want to evaluate

supxd(xx𝖳xΛ2).subscriptsupremum𝑥superscript𝑑superscript𝑥superscript𝑥𝖳superscriptsubscriptnorm𝑥Λ2\sup_{x\in\mathbb{R}^{d}}(x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|x\|_{\Lambda}%^{2}).roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

By the Cauchy-Schwarz inequality we have xx𝖳xΛxΛ1superscript𝑥superscript𝑥𝖳subscriptnorm𝑥Λsubscriptnormsuperscript𝑥superscriptΛ1x^{*}{{}^{\scriptscriptstyle\sf T}}x\leq\|x\|_{\Lambda}\|x^{*}\|_{\Lambda^{%\shortminus 1}}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x ≤ ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and so we have

xx𝖳xΛ2xΛxΛ1xΛ2.superscript𝑥superscript𝑥𝖳superscriptsubscriptnorm𝑥Λ2subscriptnorm𝑥Λsubscriptnormsuperscript𝑥superscriptΛ1superscriptsubscriptnorm𝑥Λ2x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|x\|_{\Lambda}^{2}\leq\|x\|_{\Lambda}\|x%^{*}\|_{\Lambda^{\shortminus 1}}-\|x\|_{\Lambda}^{2}.italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Hence

supxd(xx𝖳xΛ2)supt0(txΛ1t2)=14xΛ12.subscriptsupremum𝑥superscript𝑑superscript𝑥superscript𝑥𝖳superscriptsubscriptnorm𝑥Λ2subscriptsupremum𝑡0𝑡subscriptnormsuperscript𝑥superscriptΛ1superscript𝑡214superscriptsubscriptnormsuperscript𝑥superscriptΛ12\sup_{x\in\mathbb{R}^{d}}(x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|x\|_{\Lambda}%^{2})\leq\sup_{t\geq 0}(t\|x^{*}\|_{\Lambda^{\shortminus 1}}-t^{2})=\dfrac{1}{%4}\|x^{*}\|_{\Lambda^{\shortminus 1}}^{2}.roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ roman_sup start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ( italic_t ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By attainability of equality in the Cauchy-Schwarz inequality, the supremum are equal, and we have

supxd(xx𝖳xΛ2)=14xΛ12.subscriptsupremum𝑥superscript𝑑superscript𝑥superscript𝑥𝖳superscriptsubscriptnorm𝑥Λ214superscriptsubscriptnormsuperscript𝑥superscriptΛ12\sup_{x\in\mathbb{R}^{d}}(x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|x\|_{\Lambda}%^{2})=\dfrac{1}{4}\|x^{*}\|_{\Lambda^{\shortminus 1}}^{2}.roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

This proofs the convex conjugate of xΛ2superscriptsubscriptnorm𝑥Λ2\|x\|_{\Lambda}^{2}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Now consider the convex conjugate of xΛsubscriptnorm𝑥Λ\|x\|_{\Lambda}∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT, then we need to evaluate

supxd(xx𝖳xΛ),subscriptsupremum𝑥superscript𝑑superscript𝑥superscript𝑥𝖳subscriptnorm𝑥Λ\sup_{x\in\mathbb{R}^{d}}(x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|x\|_{\Lambda}),roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ) ,

again, by Cauchy-Schwarz and the attainability of equality, we have

supxd(xx𝖳xΛ)subscriptsupremum𝑥superscript𝑑superscript𝑥superscript𝑥𝖳subscriptnorm𝑥Λ\displaystyle\sup_{x\in\mathbb{R}^{d}}(x^{*}{{}^{\scriptscriptstyle\sf T}}x-\|%x\|_{\Lambda})roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT sansserif_T end_FLOATSUPERSCRIPT italic_x - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT )=supxd(xΛxΛ1xΛ)absentsubscriptsupremum𝑥superscript𝑑subscriptnorm𝑥Λsubscriptnormsuperscript𝑥superscriptΛ1subscriptnorm𝑥Λ\displaystyle=\sup_{x\in\mathbb{R}^{d}}(\|x\|_{\Lambda}\|x^{*}\|_{\Lambda^{%\shortminus 1}}-\|x\|_{\Lambda})= roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT )
=supx𝕕(xΛ(xΛ11))absentsubscriptsupremum𝑥superscript𝕕subscriptnorm𝑥Λsubscriptnormsuperscript𝑥superscriptΛ11\displaystyle=\sup_{x\in\mathbb{{R}^{d}}}(\|x\|_{\Lambda}(\|x^{*}\|_{\Lambda^{%\shortminus 1}}-1))= roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT blackboard_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ italic_x ∥ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ( ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - 1 ) )
={0ifxΛ11,+otherwise.absentcases0ifsubscriptnormsuperscript𝑥superscriptΛ11otherwise\displaystyle=\begin{cases}0&\text{if }\|x^{*}\|_{\Lambda^{\shortminus 1}}\leq1%,\\+\infty&\text{otherwise}.\end{cases}= { start_ROW start_CELL 0 end_CELL start_CELL if ∥ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ 1 , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise . end_CELL end_ROW

This completes the proof.

Appendix F Toland’s Duality

The duality theory of Toland’s (Toland,, 1978, 1979) concerns the minimization of nonconvex functions, in particular, applies to the minimization of the difference of convex functions (DC problems). The duality holds under minimal conditions, and one tries to see if the DC problem can be transformed into something more manageable.

Theorem A6 (Toland’s Duality).

Let f𝑓fitalic_f and g𝑔gitalic_g be functions on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, if fΓ(d)𝑓Γsuperscript𝑑f\in\Gamma\left(\mathbb{R}^{d}\right)italic_f ∈ roman_Γ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), then we have

infxd{f(x)g(x)}=infxd{g(x)f(x)}.subscriptinfimum𝑥superscript𝑑𝑓𝑥𝑔𝑥subscriptinfimumsuperscript𝑥superscript𝑑superscript𝑔superscript𝑥superscript𝑓superscript𝑥\inf_{x\in\mathbb{R}^{d}}\left\{f(x)-g(x)\right\}=\inf_{x^{*}\in\mathbb{R}^{d}%}\left\{g^{*}(x^{*})-f^{*}(x^{*})\right\}.roman_inf start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_f ( italic_x ) - italic_g ( italic_x ) } = roman_inf start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } .

Toland’s duality is implicitly used in the proof to Theorem 1 and Lemma A3 which also sketches a proof to the above duality theorem.

Knowledge-Guided Wasserstein Distributionally Robust Optimization (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Delena Feil

Last Updated:

Views: 5846

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.