Zitao WangDepartment of Statistics, Columbia University.Ziyuan WangDepartment of Industrial Engineering and Management Sciences, Northwestern University.Molei LiuThese authors contributed equally to this work.Department of Biostatistics, Columbia Mailman School of Public Health.Nian Si∗Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology.
Abstract
Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.
Keywords: Wasserstein distributionally robust optimization; Knowledge-guided learning; Difference-of-convex optimization; Shrinkage-based transfer learning.
1 Introduction
Traditional machine learning methods or empirical risk minimization often suffer from overfitting and a lack of generalization power, particularly in high-dimensional and small-sample-size settings. In recent years, distributionally robust optimization (DRO) has emerged as a powerful framework for mitigating the effects of model misspecification and enhancing robustness in machine learning generalizations. Among various DRO formulations, Wasserstein-DRO (WDRO) gained more attention due to its tractability and generalizability. Specifically, in WDRO, one optimizes over worst-case distributions within an ambiguity set defined by a Wasserstein ball centered at an empirical measure.
However, one persistent challenge with WDRO is its tendency to be overly conservative, which can lead to suboptimal performance in practice as found in Liu etal., (2024). In many real-world scenarios, prior knowledge can be leveraged to improve model performance and robustness. For example, in electronic healthcare record data, prior knowledge might come from predictive models trained on existing large, population-wide datasets. In such a context, transfer learning has proven to be a versatile approach for improving performance on a target task. Despite its successes, the integration of prior knowledge into WDRO frameworks has remained an open question.
In this work, we introduce Knowledge-Guided Wasserstein Distributionally Robust Optimization (KG-WDRO), a novel framework that adapts the Wasserstein ambiguity set using external knowledge (parameters). We assume access to prior predictors of pre-trained models, which can guide the predictive model in the target dataset. By constraining the transport cost along directions informed by prior knowledge, our approach addresses the conservativeness of vanilla WDRO while preserving robustness. Intuitively, this strategy allows the model to focus its uncertainty on regions where prior knowledge is less reliable, effectively robustify knowledge-guided generalization.
1.1 Related Works
1.1.1 Wasserstein DRO
Wasserstein DRO has recently garnered significant attention due to its tractability (Blanchet and Murthy,, 2019; MohajerinEsfahani and Kuhn,, 2018; Gao and Kleywegt,, 2023) and generalizability (Blanchet etal., 2019a, ; Gao etal.,, 2022). Notably, Blanchet etal., 2019a and Gao etal., (2022) demonstrate that Wasserstein DRO with mean square loss is equivalent to the square root lasso Belloni etal., (2011). Similarly, Shafieezadeh-Abadeh etal., (2015, 2019); Blanchet etal., 2019a ; Gao etal., (2022) establish that Wasserstein DRO with logistic loss and hinge loss corresponds to their regularized counterparts. Moreover, the statistical properties of the WDRO estimator have also been investigated in Blanchet etal., (2021, 2022); Gao, (2023). However, leveraging external knowledge in Wasserstein DRO has been an open problem.
1.1.2 Transfer Learning
Improving prediction accuracy for target populations by integrating diverse source datasets has driven methodological advances in transfer learning. Contemporary approaches aim to address challenges including distributional heterogeneity and limited labeled target data. A common assumption is that the target outcome model aligns partially with source models, enabling knowledge transfer. For example, recent frameworks employ selective parameter reduction to identify transferable sources and sparse or ridge shrinkage to leverage their knowledge (Bastani,, 2020; Li etal.,, 2021; Tian and Feng,, 2023). Subsequent works tackle covariate distribution mismatches and semi-supervised scenarios, enhancing robustness when labeled target data is scarce (Cai etal.,, 2024; He etal.,, 2024; Zhou etal.,, 2024). Further innovations include geometric or profile-based adaptations, where the target model is represented as a weighted combination of source coefficients (Gu etal.,, 2024; Lin etal.,, 2024).
Methods | Ridge-type | Lasso-type | ScaleAdjustment | Continuousoutcome | Binaryoutcome | PartialTransfer | Multi-Sourceensemble |
KG-WDRO | |||||||
Bastani, (2020) | |||||||
Li etal., (2021) | |||||||
Tian and Feng, (2023) | |||||||
Gu etal., (2024) | |||||||
Lin etal., (2024) |
1.2 Our Contribution
Our contributions are fourfold. Framework: We introduce KG-WDRO, a principled and flexible framework that integrates prior knowledge into WDRO for linear regression and binary classification. This framework mitigates the conservativeness of standard WDRO, enables automated covariate scaling adjustments, and prevents negative transfer. Theory: We establish the equivalence between KG-WDRO and shrinkage-based estimation methods, offering a novel perspective that unifies and interprets a broad range of knowledge transfer learning approaches through the lens of distributional robustness. Table 1 provides an overview of them, highlighting their key capabilities and advantages and comparing them with our framework. Technicalities: Leveraging Toland’s Duality (Theorem A6), we reformulate the innermost maximization in WDRO’s strong duality (Proposition 1) into a univariate optimization problem (B). This reformulation enhances tractability while accommodating more general cost functions. Empirical Validation: Through extensive experiments, we demonstrate the effectiveness of KG-WDRO in improving small-sample transfer learning.
Below is an overview of our main results for the linear regression case.
Example 1.
Suppose is an accessible prior predictor for a linear model parameterized with . We show that the shrinkage-based transfer-learning regression problem, which estimates a target predictor by solving
can be interpreted as a Wasserstein distributionally robust optimization (WDRO) problem of the form (WDRO), where the loss function is least squares, and the ambiguity set is defined as a ball around the empirical measure. The cost function augments the standard transport cost by the constraint so that
This establishes a distributionally robust optimization (DRO) perspective on a broad class of transfer-learning methods as will be discussed in Section 3.
1.3 Notations & Organizations
We summarize the mathematical notations used in this work. The positive integers , , and denote, respectively, the target sample size, the number of sources, and the dimension of the support of the covariate . The integers and are reserved for pairs of Hölder conjugates, satisfying for , as well as the pair and . For a distribution supported on the Euclidean space , we use to denote the empirical measure of with sample size . In modeling the target-covariate relationship, the distribution is often factorized as . For a vector , denotes the -norm, where , and denote the transpose of . For any two vectors , the notation denote the cosine of the angle between and , calculated by . All vectors are assumed to be column vectors. Other specialized notations are defined in context as needed.
The remainder of the paper is organized as follows. Section 2 provides a review of the WDRO framework, including the strong duality result. In Section 3, we introduce our KG-WDRO framework and demonstrate its equivalence to shrinkage-based estimations in both linear regression and binary classification. Section 4 presents comprehensive results from our numerical simulations. All proofs and detailed descriptions of the numerical simulation setups are provided in the appendix.
2 Preliminaries
We first begin with a short overview of the distributionally robust framework on statistical learning.
2.1 Optimal Transport Cost
Let and denote two probability distributions supported on , and we use to label the set of all probability measures on the product space . We say that an element has first marginal and second marginal if
for all Borel measurable sets . The class of all such measures is collected as , and is called the set of transport plans, which is always non-empty. Choose a non-negative, lower semi-continuous function such that whenever , then the Kantorovich’s formulation of optimal transport is defined as
It is well-known that (Villani,, 2009, Theorem 4.1) there exists an optimal coupling that solves the Kantorovich’s problem . Intuitively, we may think of the value as the cost of transferring one unit of mass from to , then gives the average cost of transferring under the plan . The optimal transport cost gives a measure of discrepancy between probability distributions on .
If defines a metric on , then for any the optimal transport cost,
defines a metric between probability distributions and metrizes weak convergence under moment assumptions. It is called the -Wasserstein distance. We direct the interested readers to (Villani,, 2009, Chapter 6) for more details. It is worth mentioning that none of our judiciously chosen cost functions qualify as metrics on the support of the data.
2.2 Distributionally Robust Optimization
In standard statistical learning framework, one generally assumes that the target-covariate pair follows a data-generating distribution on the support . One then seeks to find a ‘best’ parameter that relates to through a parameterized model by solving the stochastic optimization,
The loss function provides a quantification of the goodness-of-fit in the parameter given the realized observation . Since only samples are observed, we can typically only solve the empirical objective,
Therefore the distribution that underlies the data-generating mechanism is uncertain to the decision-maker. This motivated the distributionally robust optimization (DRO) framework, which entails solving the following minimax stochastic program:
where the ambiguity set represents a class of probability measures supported on that are candidates to the true data-generating distributions. In Wasserstein-DRO, the ambiguity set is constructed by forming a ‘-ball’ around the canonical empirical measure associated to the decision-maker-defined transport cost , i.e. we let the ambiguity set be chosen as:
(WDRO) |
This ambiguity set captures probability measures that are close to the observed empirical measure in the transport cost , which may be taken as a class of candidates of measures perturbed from . The solution to (2.2) that solves the worst case expected loss should perform well over the entire set of perturbations in the ambiguity set. This is in contrast to that solves (2.2) only performs well on the training samples. This adds a robustness layer to the WDRO problem (WDRO). For a comprehensive overview of different constructions of ambiguity sets, we direct the interested reader to (Kuhn etal.,, 2024, Section 2).
2.3 Strong Duality of Wasserstein DRO
The Wasserstein DRO problem involves an inner maximization over an infinite-dimensional set, which appears computationally intractable. However, the distribution is discrete, strong duality of the Wasserstein DRO reformulates it as a simple univariate optimization.
Proposition 1 (Strong Duality, (Blanchet etal., 2019a, , Proposition 1)).
Let be a lower semi-continuous cost function satisfying whenever . Then the distributionally robust regression problem
is equivalent to,
where is given by,
For more general results, see (Blanchet and Murthy,, 2019, Theorem 1) and (Gao etal.,, 2022, Section 2). The exchangeability of and in Wasserstein-DRO is also established by (Blanchet etal., 2019a, , Lemma 1).
3 Knowledge-Guided Wasserstein DRO
In this section, we propose new cost functions for the Wasserstein DRO framework that leverage prior knowledge for transfer learning. For linear regression and binary classification, these cost functions act as regularizers, encouraging collinearity with prior knowledge.
3.1 Knowledge-Guided Transport Cost
It is shown in (Blanchet etal., 2019a, , Theorem 1) that using the squared -norm on the covariates as the cost function
(1) |
equates Wasserstein distributionally robust linear regression with -norm regularization on the root mean squared error (RMSE). The cost function perturbs only the observed covariates , while keeping the observed targets fixed. Keeping the observed target as fixed often leads to more mathematically tractable reformulation, another intuition is that we trust the mechanism by which the target is generated once is known.
In the presence of prior knowledge that may aid in inferring , we aim to control the extent of perturbation along the direction of .
Specifically, we constrain the size of the prediction discrepancy , where .To achieve this goal, we henceforth augment the cost function with an additional penalty term that accounts for the size of the perturbation in the direction of :
(2) |
where and is a non-negative, monotone increasing function of such that . Recall that in the cost function , the targets remain fixed. Intuitively, the new cost function (2) encourages the Wasserstein ambiguity set to include distributions whose marginals in generate predictions that align with the data based on the prior predictor . The parameter controls the level of confidence in the prior knowledge. We call this kind of cost functions knowledge-guided. Since upper bounds the cost function , we have whenever .
The corresponding optimal transport problem given by:
can also be expressed as:
This formulation regularizes the original optimal transport problem by penalizing large values of the expectation .
For any user-defined function that measures the discrepancy in generalization with respect to the prior knowledge , we refer to it as weak-transferring of knowledge if , and strong-transferring of knowledge if . In the case of strong-transferring, to ensure the finiteness of the optimal transport problem, the minimizing transport plan must satisfy the orthogonality condition , -almost surely. Consequently, the value of remains unchanged after perturbing within . As a result, this should promote as .
Remark 1.
The above framework extends to incorporate multi-sites prior knowledge, meaning that instead of a single prior knowledge coefficient , we consider a set of coefficients . Let represent the linear span of these prior knowledge coefficients. In the case of strong-transferring, we must ensure that ; otherwise, the set of orthogonality conditions would imply that the perturbation is identically zero (). This would render the ambiguity set redundant and reduce the WDRO problem (WDRO) to the ERM problem (2.2). This result is confirmed by the statements of Theorems 1 and 3.
3.2 Linear Regression
We begin by examining the WDRO problem (WDRO) for linear regression within the strong-transferring domain. Following this, we present a specific case within the weak-transferring domain. Let represent the linear span of the prior knowledge.
3.2.1 Strong-Transferring
Define the cost function , and for a set of observed samples , we use . Without making any additional distributional assumptions on , we obtain the following finite-dimensional representation.
Theorem 1 (Linear Regression with Strong-Transferring).
Consider the least-squared loss , then for any we have
where is such that .
From the above result, we observe that the knowledge-guided WDRO problem for linear regression is equivalent to regularizing the RMSE with a -norm distance to the linear span . The regularization parameter is entirely determined by the size (or radius) of the Wasserstein ambiguity set. Importantly, the penalty term focuses on the collinearity with the prior knowledge rather than their algebraic difference or angular proximity.
Consider the case when there is only a single prior knowledge , the penalty term does not constrain the solution to be close to , but rather to for some to be optimized. Consequently, this knowledge transfer automatically robustify solution against scaling of covariates. Furthermore, it can prevent negative transfer by adapting its sign to be positively correlated with , which is the solution to population objective (2.2). When , the penalty term becomes dominant, forcing to lie in for any . This reduces the WDRO problem to a simple constrained regression problem,
reflecting the complete reliance on the prior knowledge and prevents excessive shrinkage towards the null estimator.
Remark 2.
We now discuss two special cases of the penalty term, (ridge-type regularization) and (lasso-type regularization). For simplicity, we consider the case of a single prior knowledge vector .
Ridge-type. The penalty term can be explicitly calculated as
where is the component of orthogonal to . This penalty term shrinks distance to the line in the direction of . Furthermore, note that
which represents a trade-off between the magnitude of and its angular proximity to the prior knowledge . This trade-off is illustrated in the leftmost figure of Fig.1, drawing the feasibility set of the regularization as a constraint. This regularization is closely related but different to the one proposed in Gu etal., (2024), where they penalize large values of a computational relaxation of .
Lasso-type. When the prior knowledge is sparse, the penalty term promotes sparse representation learning. Consider a simple example where the dimension is and . In this case, we have:
where . This formulation enforces sparsity only on the last two components of , reflecting the sparsity pattern of .
3.2.2 Weak Transferring
For the special case of , we define the weak-transferring cost function with . Here, we select as the user-defined function on controlling the size of perturbation in . For simplicity, we consider a single prior knowledge vector in this setup. This result can be straightforwardly extended to a multi-source setup with different values of ’s.
Theorem 2 (Linear Regression with Weak Transferring).
Consider the least-squared loss , then for we have
With and .
Write , we note that as , we have recovering the projection matrix onto the prior knowledge . Consequently, .
We observe that the action
is exactly the ridge regression of onto with a regularization parameter . Thus, the finiteness of , which can reflect a caution in the prior knowledge , induces a shrinkage effect on the component of explainable by in the dot product geometry. Since , we have for any finite , this implies the inclusion of feasibility set
as plotted in Fig.1 for an illustration on . The contour forms an ellipse centered around the origin . The ellipse has a major axis of half length aligned with the direction of , and a minor axis with half-length aligned with the direction of . As , representing no-confidence in , the half-length of the major axis converges to , resulting in a perfect circle as in ridge regression.
The two-dimensional hyper-parameters enable the use of data-driven methods, such as grid-search cross-validation, for hyper-parameter tuning. Unlike the strong-transferring domain, the inclusion of allows the data to self-determine the informativeness of the source samples.
3.3 Binary Classifications
In this section, we focus on the context of binary classification, where the goal is to predict the discrete label based on the covariates . Unlike the previous section, we use the -norm, rather than its square, to account for distributional ambiguity in the covariate distribution. Define the strong-transferring cost function We consider two loss functions here. The logistic loss function is given by
which is the negative log-likelihood of the model that postulates
The hinge loss is given by
which is typically used for training classifiers that look for ‘maximum-margins’ in class boundaries, most notably support vector machines.
Suppose is binary and without any distributional assumptions on , we have the following result which recovers regularized logistic regressions and support vector machines.
Theorem 3 (Binary Classification with Strong Transferring).
Suppose the loss function is either the logistic loss or the hinge loss , then for any we have
where is such that .
3.4 Sub-Coefficient-Vector Transferring
In this subsection, we generalize the statements of Theorems 1 and 3 for to arbitrary norms induced by positive-definite quadratic forms. Let be a positive-definite symmetric matrix. The norm induces a metric on , defined as , known as the Mahalanobis distance. Since is positive definite, it admits a decomposition with invertible, and the norm measures length in the geometry distorted by . By (Blanchet etal., 2019b, , Lemma 1), the dual norm of is . Using Proposition A4, the statements of Theorems 1 and 3 can be easily generalized. Define the space of positive-definite symmetric matrices as and the cost function: .
Corollary 1 (Theorem 1).
For the least-squares loss and any :
This formulation enables the use of metric learning methods to determine directly from the data, as detailed in Blanchet etal., 2019b . For example, if the two-dimensional prior is known to primarily influence the first component of the truth , we can select with . This imposes a weaker penalty on perturbations in the first direction, resulting in a weighted penalty term: , which prioritizes aligning with , while is determined more flexibly based on the data. We call this sub-coefficient-vector transferring, or the ability to partially transfer prior knowledge. A similar corollary applies to Theorem 3, as stated in Corollary 2.
Finally, we again draw the reader’s attention to Table 1, which compares several transfer learning methods discussed in Section 1.1.2. Notably, our proposed KG-WDRO framework brings together a broad range of desirable capabilities within a single, unified approach to transfer learning.
4 Numerical Results

In this section, we present numerical simulations to validate the effectiveness of the proposed KG-WDRO method. We compare learners across different settings, including high-dimensional sparse models, correlated covariates, and multi-source prior knowledge, for either linear regression or binary classification tasks. Performance is evaluated using out-of-sample classification error for binary classifiers and out-of-sample for linear regressors.
For the single-source experiments, target-source coefficient pairs are generated from a multivariate normal distribution:
where is the correlation between and , and the expected length of is approximately . We scale as with to study the stabilizing effects of strong prior knowledge in small-sample settings. The dimension-to-sample ratio is varied by fixing and increasing . Performance is averaged over 100 simulations. Each dataset consists of three parts: data = (train, val, test). The (train, val) pair shares the same size, and hyperparameters are selected based on validation performance. The source data contains 800 samples, with source truth estimated accordingly. Out-of-sample performance is measured on the test set of 5000 data points.
4.1 Simulation 1: Logistic with -Strong Transferring
In the first experiment, we compare two learners for binary classification tasks with high-dimensional sparse coefficients against our proposed KG-WDRO learner, , derived using Theorem 3 with . The competing learners are the target-only vanilla WDRO learner (Blanchet etal., 2019a, , Theorem 2) and , obtained via the -Trans-GLM algorithm (Tian and Feng,, 2023, Algorithm 1). The target-source pair is generated using (4) with a dimension of 50 and augmented with 100 zeros for sparsity, resulting in a total dimension of 150. We test six settings, varying the sample size , signal strength , and truth-prior correlation .
The comparison between and is highly competitive, with consistently outperforming by up to in accuracy when the sample size is small () across all values of , as shown in the upper-left plot of Figure 2. In larger sample size scenarios, both learners perform similarly (see Table A1 for detailed results). Both transfer learning methods, and , significantly outperform the target-only learner, .
4.2 Simulation 2: Linear Regression with -Weak Transferring
In this simulation, we compare two learners on high-dimensional linear regression with correlated covariates against our proposed learners, (Theorem 2) and (Theorem 1), both using . There is no sparsity in the regression coefficients. The competing learners are the target-only vanilla WDRO learner (Blanchet etal., 2019a, , Theorem 1) and the Trans-Ridge algorithm adapted from (Li etal.,, 2021, Algorithm 1), denoted as . The covariates are fixed at dimension 100, with a pairwise correlation of 0.3. The experiment is conducted across six settings, varying the sample size , signal strength , and truth-prior correlation .
As shown in the upper-right plot of Figure 2, the performance of lags significantly behind both and until the correlation becomes sufficiently high. Across all settings, and consistently outperform when is moderate or low, as documented in Table A2. Furthermore, all three transfer learning methods demonstrate superior performance compared to the target-only learner, .
4.3 Simulation 3: Transfer Learning with Multiple Sites
In the final set of experiments, we validate our methods in a multi-source transfer learning setting with high-dimensional sparse linear regression. The significant components of the three source coefficients are generated using (4) with correlation and dimension 50, denoted as . We construct a linear combination, , and generate , where , ensuring . is then scaled to match the magnitude of , and all vectors are augmented with 100 zeros, yielding a total dimension of 150. Our proposed method, (Theorem 1, ), is compared against the oracle Trans-Lasso algorithm (Li etal.,, 2021, Algorithm 1) () and the vanilla WDRO learner . The experiment spans six settings: and , with and , respectively. Sample sizes vary in . The truth-prior correlation ranges in .
When , the contributions of the ’s to the generation of are unequal. In this case, it is not surprising that outperforms , as shown in the bottom-left plot of Figure 2. When is an equal-weighted average of the ’s (), the performance of and becomes similar. However, still demonstrates superior performance in larger sample sizes and higher correlations, as documented in Table A3.
5 Conclusion
We propose the knowledge-guided Wassersteindistributionally robust optimization (KG-WDRO) framework, which utilizes prior knowledge of predictors to mitigate the over-conservativeness of conventional DRO methods. We establish tractable reformulations and demonstrate their superior performance compared to other methods. For future work, we aim to provide statistical guarantees of our proposed estimators. Furthermore, based on these statistical properties, we plan to develop a principled approach for selecting hyperparameters such as and .
References
- Bastani, (2020)Bastani, H. (2020).Predicting with proxies: Transfer learning in high dimension.Management Science, 67(5):2964–2984.
- Belloni etal., (2011)Belloni, A., Chernozhukov, V., and Wang, L. (2011).Square-root lasso: Pivotal recovery of sparse signals via conic programming.Biometrika, 98(4):791–806.
- (3)Blanchet, J., Kang, Y., and Murthy, K. (2019a).Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857.
- (4)Blanchet, J., Kang, Y., Murthy, K., and Zhang, F. (2019b).Data-driven optimal transport cost selection for distributionally robust optimization.In 2019 Winter Simulation Conference (WSC), pages 3740–3751.
- Blanchet and Murthy, (2019)Blanchet, J. and Murthy, K. (2019).Quantifying distributional model risk via optimal transport.Mathematics of Operations Research, 44(2):565–600.
- Blanchet etal., (2021)Blanchet, J., Murthy, K., and Nguyen, V.A. (2021).Statistical analysis of wasserstein distributionally robust estimators.In Tutorials in Operations Research: Emerging optimization methods and modeling techniques with applications, pages 227–254. INFORMS.
- Blanchet etal., (2022)Blanchet, J., Murthy, K., and Si, N. (2022).Confidence regions in wasserstein distributionally robust estimation.Biometrika, 109(2):295–315.
- Cai etal., (2024)Cai, T., Li, M., and Liu, M. (2024).Semi-supervised triply robust inductive transfer learning.Journal of the American Statistical Association, pages 1–14.
- Gao, (2023)Gao, R. (2023).Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality.Operations Research, 71(6):2291–2306.
- Gao etal., (2022)Gao, R., Chen, X., and Kleywegt, A.J. (2022).Wasserstein distributionally robust optimization and variation regularization.Operations Research, 72(3):1177–1191.
- Gao and Kleywegt, (2023)Gao, R. and Kleywegt, A. (2023).Distributionally robust stochastic optimization with wasserstein distance.Mathematics of Operations Research, 48(2):603–655.
- Gu etal., (2024)Gu, T., Han, Y., and Duan, R. (2024).Robust angle-based transfer learning in high dimensions.Journal of the Royal Statistical Society Series B: Statistical Methodology.
- He etal., (2024)He, Z., Sun, Y., and Li, R. (2024).Transfusion: Covariate-shift robust transfer learning for high-dimensional regression.In International Conference on Artificial Intelligence and Statistics, pages 703–711. PMLR.
- Kuhn etal., (2024)Kuhn, D., Shafiee, S., and Wiesemann, W. (2024).Distributionally robust optimization.arXiv preprint arXiv: 2411.02549.
- Li etal., (2021)Li, S., Cai, T.T., and Li, H. (2021).Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149–173.
- Lin etal., (2024)Lin, Z., Zhao, J., Wang, F., and Wang, H. (2024).Profiled transfer learning for high dimensional linear model.arXiv preprint arXiv:2406.00701.
- Liu etal., (2024)Liu, J., Wang, T., Cui, P., and Namkoong, H. (2024).Rethinking distribution shifts: Empirical analysis and inductive modeling for tabular data.arXiv preprint arXiv: 2307.05284.
- Luenberger and Ye, (2008)Luenberger, D.G. and Ye, Y. (2008).Linear and Nonlinear Programming.International Series in Operations Research & Management Science. Springer New York, NY.
- MohajerinEsfahani and Kuhn, (2018)MohajerinEsfahani, P. and Kuhn, D. (2018).Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Programming, 171(1):115–166.
- Rockafellar, (1970)Rockafellar, R.T. (1970).Convex Analysis.Princeton University Press, Princeton, N.J.
- Shafieezadeh-Abadeh etal., (2015)Shafieezadeh-Abadeh, S., Esfahani, P.M., and Kuhn, D. (2015).Distributionally robust logistic regression.In Advances in Neural Information Processing Systems, volume28.
- Shafieezadeh-Abadeh etal., (2019)Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P.M. (2019).Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68.
- Tian and Feng, (2023)Tian, Y. and Feng, Y. (2023).Transfer learning under high-dimensional generalized linear models.Journal of the American Statistical Association, 118(544):2684–2697.
- Toland, (1978)Toland, J.F. (1978).Duality in nonconvex optimization.Journal of Mathematical Analysis and Applications, 66(2):399–415.
- Toland, (1979)Toland, J.F. (1979).A duality principle for non-convex optimisation and the calculus of variations.Archive for Rational Mechanics and Analysis, 71:41–61.
- Villani, (2009)Villani, C. (2009).Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften.Springer Berlin, Heidelberg, 1 edition.
- Zhou etal., (2024)Zhou, D., Li, M., Cai, T., and Liu, M. (2024).Model-assisted and knowledge-guided transfer regression for the underrepresented population.arXiv preprint arXiv: 2410.06484.
Appendix A Additional Details in Numerical Results
This section provides details to supplement Section 4. We outline the data-generating distributions for all three sets of experiments, the hyperparameter grids, and the learners used to identify prior knowledge. We present the exact numerical results for all three sets of experiments. Recall that the notation represents the signal strength of the true parameter , which works by rescaling the magnitude of such that . The notation is the dimension of the covariates, and is the sample size. Finally, the symbol represents the correlation between the true and the prior .
A.1 Simulation Results
A.1.1 Simulation 1: Logistic Regression
Setting | WDRO | |||||||
KG-WDRO | 0.587 | 0.647 | 0.714 | 0.748 | 0.794 | 0.817 | 0.565 | |
Trans-GLM | 0.585 | 0.641 | 0.702 | 0.735 | 0.778 | 0.800 | - | |
KG-WDRO | 0.586 | 0.647 | 0.713 | 0.751 | 0.797 | 0.823 | 0.619 | |
Trans-GLM | 0.586 | 0.645 | 0.710 | 0.752 | 0.792 | 0.823 | - | |
KG-WDRO | 0.583 | 0.646 | 0.713 | 0.751 | 0.798 | 0.823 | 0.654 | |
Trans-GLM | 0.584 | 0.646 | 0.714 | 0.755 | 0.800 | 0.826 | - | |
KG-WDRO | 0.581 | 0.634 | 0.690 | 0.721 | 0.762 | 0.787 | 0.549 | |
Trans-GLM | 0.579 | 0.626 | 0.674 | 0.708 | 0.748 | 0.760 | - | |
KG-WDRO | 0.580 | 0.635 | 0.689 | 0.728 | 0.768 | 0.794 | 0.588 | |
Trans-GLM | 0.579 | 0.633 | 0.693 | 0.723 | 0.769 | 0.789 | - | |
KG-WDRO | 0.581 | 0.637 | 0.700 | 0.732 | 0.775 | 0.790 | 0.617 | |
Trans-GLM | 0.581 | 0.638 | 0.702 | 0.737 | 0.779 | 0.799 | - |
A.1.2 Simulation 2: Linear Regression
Setting | WDRO | |||||||
KG-WDRO (Strong) | 0.585 | 0.645 | 0.740 | 0.801 | 0.870 | 0.912 | 0.108 | |
KG-WDRO (Weak) | 0.583 | 0.646 | 0.741 | 0.800 | 0.871 | 0.910 | - | |
Trans-Ridge | 0.391 | 0.548 | 0.706 | 0.786 | 0.870 | 0.915 | - | |
KG-WDRO (Strong) | 0.707 | 0.745 | 0.803 | 0.843 | 0.894 | 0.924 | 0.513 | |
KG-WDRO (Weak) | 0.704 | 0.743 | 0.803 | 0.842 | 0.892 | 0.923 | - | |
Trans-Ridge | 0.599 | 0.692 | 0.788 | 0.838 | 0.893 | 0.925 | - | |
KG-WDRO (Strong) | 0.806 | 0.827 | 0.859 | 0.881 | 0.911 | 0.932 | 0.758 | |
KG-WDRO (Weak) | 0.804 | 0.825 | 0.857 | 0.880 | 0.910 | 0.930 | - | |
Trans-Ridge | 0.762 | 0.802 | 0.849 | 0.877 | 0.910 | 0.932 | - | |
KG-WDRO (Strong) | 0.563 | 0.621 | 0.716 | 0.777 | 0.850 | 0.894 | 0.030 | |
KG-WDRO (Weak) | 0.561 | 0.622 | 0.716 | 0.777 | 0.849 | 0.892 | - | |
Trans-Ridge | 0.213 | 0.405 | 0.600 | 0.700 | 0.803 | 0.858 | - | |
KG-WDRO (Strong) | 0.673 | 0.713 | 0.774 | 0.818 | 0.872 | 0.905 | 0.361 | |
KG-WDRO (Weak) | 0.670 | 0.710 | 0.774 | 0.816 | 0.869 | 0.903 | - | |
Trans-Ridge | 0.470 | 0.585 | 0.704 | 0.768 | 0.837 | 0.875 | - | |
KG-WDRO (Strong) | 0.768 | 0.791 | 0.826 | 0.851 | 0.886 | 0.911 | 0.703 | |
KG-WDRO (Weak) | 0.765 | 0.788 | 0.825 | 0.851 | 0.885 | 0.909 | - | |
Trans-Ridge | 0.671 | 0.724 | 0.785 | 0.821 | 0.863 | 0.890 | - |
A.1.3 Simulation 3: Multi-Sites
Here, recall that the notation denote the correlation of generating the three prior knowledge under the scheme (4).
Setting | WDRO | |||||||
KG-WDRO | 0.560 | 0.640 | 0.713 | 0.783 | 0.850 | 0.916 | -0.584 | |
Trans-Lasso | 0.578 | 0.625 | 0.673 | 0.723 | 0.767 | 0.815 | - | |
KG-WDRO | 0.674 | 0.728 | 0.776 | 0.825 | 0.875 | 0.926 | 0.027 | |
Trans-Lasso | 0.666 | 0.697 | 0.732 | 0.770 | 0.808 | 0.850 | - | |
KG-WDRO | 0.793 | 0.820 | 0.848 | 0.878 | 0.907 | 0.939 | 0.375 | |
Trans-Lasso | 0.756 | 0.779 | 0.805 | 0.832 | 0.857 | 0.882 | - | |
KG-WDRO | 0.565 | 0.642 | 0.715 | 0.785 | 0.852 | 0.916 | -2.837 | |
Trans-Lasso | 0.628 | 0.680 | 0.735 | 0.790 | 0.838 | 0.889 | - | |
KG-WDRO | 0.673 | 0.729 | 0.778 | 0.829 | 0.877 | 0.928 | -0.015 | |
Trans-Lasso | 0.708 | 0.744 | 0.786 | 0.826 | 0.863 | 0.902 | - | |
KG-WDRO | 0.797 | 0.825 | 0.852 | 0.880 | 0.911 | 0.942 | 0.354 | |
Trans-Lasso | 0.794 | 0.820 | 0.844 | 0.868 | 0.894 | 0.919 | - |
A.2 Simulation Setup
Let denote a bernoulli distribution with probability parameter , denote a uniform distribution supported on , and denote a univariate normal distribution with mean and variance .
A.2.1 Simulation 1: Logistic Regression
In this simulation, the coefficients are generated in a high-dimensional sparse setting. The dimension of the nonzero components is set to 50, which is then augmented with 100 zero components to introduce sparsity. The nonzero components of the true coefficient-prior pair are generated using the multivariate normal scheme in (4), with component variance and . The target labels are generated as , and the source labels are generated as , where . The sample size for is varied across , while the sample size for the source data is fixed at 800. Each dataset is paired with a validation set of the same size for hyperparameter selection.
Let denote a hyperparameter grid ranging from to with log-spaced values, and let denote a hyperparameter grid ranging from to with log-spaced values. The estimator is learned by selecting the best-performing hyperparameter on using validation data. For the -Trans-GLM learner (Tian and Feng,, 2023, Algorithm 1), the transferring step is optimized using , and the debiasing step is optimized using . For the KG-WDRO learner proposed in Theorem 3 with , the prior is first learned from the source data using the vanilla WDRO method on , followed by learning on with the learned as input.
The simulations are conducted on the parameter grid
with each configuration repeated times. The average results are reported.
A.2.2 Simulation 2: Linear Regression
In this simulation, the coefficients are generated in a high-dimensional correlated setting. The dimension of the coefficients is set to 100 and the components of the true coefficient-prior pair are generated using the multivariate normal scheme in (4), with component variance and . The target labels are generated as , and the source labels are generated as , where with
The sample size for is varied across , while the sample size for the source data is fixed at 800. Each dataset is paired with a validation set of the same size for hyperparameter selection.
Let denote a hyperparameter grid ranging from to with log-spaced values, and let denote a hyperparameter grid ranging from to with log-spaced values. The estimator is learned by selecting the best-performing hyperparameter on using validation data. For the Trans-Ridge learner adapted from (Li etal.,, 2021, Algorithm 1), the transferring step is optimized using , and the debiasing step is optimized using . For the KG-WDRO learner proposed in Theorem 1 with , and the learner proposed in Theorem 2, the prior is first learned from the source data using the vanilla WDRO method on , followed by learning on with the learned as input. The grid for is 0.0001 to 8 with 20 log-spaced values.
The simulations are conducted on the parameter grid
with each configuration repeated times. The average results are reported.
A.2.3 Simulation 3: Multiple Sites
In this simulation, the coefficients are generated in a high-dimensional sparse setting. The dimension of the nonzero components is set to 50, which is then augmented with 100 zero components to introduce sparsity. The number of external source is 3, we generate the their coefficients using the scheme (4). We construct a linear combination, , and generate the target coefficient , where , ensuring . The target coefficient is then scaled to match the magnitude of .
The target labels are generated as , and the source labels are generated as for , where with
The sample size for the target data ranges in .
Let denote a hyperparameter grid ranging from to with log-spaced values, and let denote a hyperparameter grid ranging from to with log-spaced values. The estimator is learned by selecting the best-performing hyperparameter on using validation data. For the oracle Trans-Lasso learner (Li etal.,, 2021, Algorithm 1), the transferring step is optimized using , and the debiasing step is optimized using using all three source data. For the KG-WDRO learner proposed in Theorem 1 with , the priors are first learned from the three source data using the vanilla WDRO method on , followed by learning on with the learned as input.
The simulations are conducted on the parameter grid
with each configuration repeated times. The average results are reported.
Appendix B Proof of Results in Regression.
Lemma A1.
Let be defined as depending on some and let be a non-negative real-valued function in . Then the convex conjugate is given by
Therefore the biconjugate of has representation:
Proof.
The convex conjugate is defined as
where and are taken as fixed values. Orthogonalize in the direction of with , and such that . Then , we have , and the convex conjugate becomes
Fixing , the objective is a negative quadratic function in , hence the objective in is bounded from above by a finite value. Now, if is not orthogonal to , the term is unbounded, and the convex conjugate . If is orthogonal to , then the convex conjugate attains finite value. Note that . Hence condition on , we have
where the optimal solution is , and the coefficient is given by the projection scalar .
The biconjuagte
is therefore bounded from below if and only if . Let for some , then substituting we get the representation,
It can be readily verified that as promised by the Fenchel-Moreau Theorem (Theorem A4).∎
Lemma A2.
Let be defined as for some . Then the convex conjugate is given by
Therefore the convex conjugate of the function for some is given by
Proof.
The convex conjugate is defined as
again, orthogonalize , where and . Now by the change of variable , the convex conjugate is now
s.t |
Thus the convex conjugate if . If for some , then
where the last equality holds by noting that is the convex conjugate of the absolute value function (Proposition A2). This proofs the convex conjugate of . Now , the convex conjugate of is
where the second line follows from the infimal convolution property of sum of convex conjugates (Theorem A5). We know that is finite if and only if for all , that is for some for all . Hence if and only if and for all . Finally we can calculate the convex conjugate by the scaling law of convex conjugates (Proposition A3) given . This concludes the proof.∎
We now give the proof to Theorem 1.
Proof of Theorem 1.
Let . Then first consider the cost function
where we replaced the transferring strength from to a finite-valued distance function that is a monotone function in , with . We will then let except at . Then the supremum function
is finite if and only if . Then, we have
Denote by , we get
Consider the objective in
where we let and . This is a convex + concave optimization, we express the convex part of as a supremum of infinitely many affine functions. Then by Lemma A1, we have , then we may write
(Toland’s Duality) |
where is the convex conjugate of . Let , with and . Then we can compute the convex conjugate of using the infimal convolution property (Theorem A5). Then
We know that , where (Proposition A2). Now suppose for some , by Lemma A2, we have,
Then the convex conjugate is
which is equivalently,
Letting , we recover the cost function , and when , each is now free in . Then we have , with , the validity of this tactic follows from (Luenberger and Ye,, 2008, Theorem 1, Section 13.1). Then we have Suppose then dividing by , we get
If , then , so the representation is valid for all . Therefore following the proof of (Blanchet etal., 2019a, , Theorem 1),
Then the minimization objective can be simplified as
where the last equality follows because is a convex function in that tends to approaching the boundaries and , so the optimization over can be solved using first-order condition. Then by Proposition 1, strong duality holds and,
This reduces the infinite-dimensional optimization to a finite-dimensional problem, where we interchanged and the quadratic function, since the quadratic function is monotone increasing on the positive reals.∎
The next proof is to Theorem 2 with the weak transferring cost function with some . The statements generalizes to multi-sites by first considering orthogonalizing the prior knowledge .
Proof of Theorem 2.
Following the proof of Theorem 1, we solve the optimization problem
where we recall that is the dual-variable in statement of Proposition 1, is the transferring strength, is the prior knowledge, and is the residual in .
Then let be an orthogonal matrix, whose first column is , then use The objective function now becomes
where the last term follows because , and denotes the first component of . Now define
and consider the change of variable , then the last two terms become
Therefore, the objective becomes
which has finite optimal value whenever , with denoting the positive-definite symmetric matrix,
that is independent of the choice of . The first equality follows because we applied Cauchy-Schwarz inequality and since is free, there is some that achieves equality. The rest of the proof follows exactly along the proof of Theorem 1 by completing the optimization over the dual problem using Proposition 1.∎
Appendix C Proof of Results in Classification.
Lemma A3.
Consider the convex function by , for some and . Then for every , the constraint optimization problem defined as,
has optimal objective value,
where with .
Proof.
This lemma is a simple extension of (Shafieezadeh-Abadeh etal.,, 2015, Lemma 1). Following their proof, it is shown that
where
is the convex conjugate of the function (Proposition A2). Then it is shown that the objective must has representation
Fixing and , then the inner maximization in
has solution subject to for some derived using the first-order condition of the Lagrangian duality or otherwise. Then condition on , the objective has representation
Consider the constraint over . Suppose , then dividing by , we get the equivalent constraint over . Defining the change of variable , then since the Lagrange multiplier is free, we have is free, and the constraint becomes over . If , then . So the equivalent constraint is valid for all . Then condition on , the objective becomes,
Recognizing that
we get
∎
The above Lemma A3 is easily generalized to incorporate multiple orthogonality constraints using the exact same Lagrangian formulation. Again, recall . Thus the optimal objective value under multiple constraints becomes
We now give the proof to Theorem 3.
Proof of Theorem 3 for Logistic Loss.
Using Proposition 1, we apply the strong duality, and consider the inner optimization problem
For each , we apply Lemma A3 to the maximization problem,
which has solution
Therefore, the maximization problem is bounded from above if and only if . Under this condition, this reduces the inner optimization problem,
This concludes the proof.∎
We now give the proof to the maximum margin classifier using the hinge loss.
Proof of Theorem 3 for Hinge Loss.
As in the case to the proof of Theorem 1, we first consider the relaxed cost function
where we relaxed the transferring strength from to some finite value . We will then let . Again, by strong duality, we can solve the worst case hinge loss by solving the dual problem
Let , then we have
Where in the second equality we used . Fixing , consider the inner minimization in ,
where is the convex conjugate of . Set and , then by the infimal convolution property of convex conjugates (Theorem A5), we know that
From Lemma A2, we know that if is finite, then subject to and for all . Now it is well known that (Proposition A2),
where denotes the convex indicator on the set . Therefore, letting , the constraints is redundant, and we have
where we let . Therefore, is finite if and only if . Now , so we can remove , and this leaves us the condition that , which is equivalent to for all , including . Taking supremum over , the final condition is . Therefore, assuming the dual problem is bounded from above, it reduces as
Finally, the dual form of the distributionally robust optimization problem is
This completes the proof.∎
Appendix D Proof of Results in Mahalanobis Norm Regularization
Proof of Corollary 1.
This is a direct consequence of the convex conjugate of given in Proposition A4.∎
Define the cost function .
Corollary 2 (Theorem 3).
Suppose the loss function is either the logistic loss or the hinge loss , then for any we have
Proof.
For the logistic loss case, this is a direct consequence of the dual norm of , for the hinge loss case this is a direct consequence of the convex conjugate of . Both given by Proposition A4.∎
Appendix E Useful Results on Convex Conjugation
In this section we review some results on the concept of convex conjugates that repeatedly come up in the proofs. For more details on convex conjugations, the interested readers can consult (Rockafellar,, 1970, Sections 12 & 16).
Definition A1 (Convex Conjugate).
Let be a real-valued function on the Euclidean space, then the convex conjugate of is the function that evaluates by
This is also called the Legendre transformation of , and the Legendre-Fenchel transformation for defined on arbitrary real topological vector spaces. Here we collect some examples of convex conjugates that appeared in the appendix. These are well-known.
Proposition A2.
Let be such that .
- 1.
The convex conjugate of the absolute value function on is given by , the convex indicator function on the set .
- 2.
The convex conjugate of the -norm on is given by , the convex indicator function on the set .
- 3.
The convex conjugate of on is given by .
- 4.
The convex conjugate of on is given by
Another easy consequence from the definition of convex conjugation is the below scaling laws.
Proposition A3 (Scaling Laws).
Let be the convex conjugate of on . Then we have,
- 1.
the convex conjugate of whenever is given by .
- 2.
the convex conjugate of whenever is given by .
Let denote the class of proper convex lower-semi continuous functions on , the next statement says that this conjugation induces an one-to-one symmetric correspondence on the class . It is a cornerstone of modern convex analysis and used in the proof of Theorem 1 and Lemma A3.
Theorem A4 (Fenchel-Moreau).
Let be a proper convex, lower semi-continuous function on , then
- 1.
the convex conjugation is a bijection on ;
- 2.
.
Proof.
For a proof please consult (Rockafellar,, 1970, Section 12).∎
The next statement concerns the commutativity of convex conjugation and function summation. Its usefulness is profound, and applied to the proof of Theorem 1 and Theorem 3.
Theorem A5 (Infimal Convolution Property of Convex Conjugation).
Let be proper convex functions on , then
and
Proof.
For a proof please consult (Rockafellar,, 1970, Theorem 16.4).∎
Proposition A4.
Let , then the dual norm of is The Cauchy-Schwarz inequality holds, and equality is attainable. The convex conjugate of is given by , and the convex conjugate of is given by .
Proof.
The dual norm of , the Cauchy-Schwarz inequality and attainability of equality follows from (Blanchet etal., 2019b, , Lemma 1). Now to compute the convex conjugate of , we want to evaluate
By the Cauchy-Schwarz inequality we have , and so we have
Hence
By attainability of equality in the Cauchy-Schwarz inequality, the supremum are equal, and we have
This proofs the convex conjugate of . Now consider the convex conjugate of , then we need to evaluate
again, by Cauchy-Schwarz and the attainability of equality, we have
This completes the proof.
∎
Appendix F Toland’s Duality
The duality theory of Toland’s (Toland,, 1978, 1979) concerns the minimization of nonconvex functions, in particular, applies to the minimization of the difference of convex functions (DC problems). The duality holds under minimal conditions, and one tries to see if the DC problem can be transformed into something more manageable.
Theorem A6 (Toland’s Duality).
Let and be functions on , if , then we have
Toland’s duality is implicitly used in the proof to Theorem 1 and Lemma A3 which also sketches a proof to the above duality theorem.