Publications

You can also find my articles on my Google Scholar profile.

Journal Articles

Xuran Meng and Yi Li,
Inference for Deep Neural Network Estimators in Generalized Nonparametric Models. — Journal of the American Statistical Association, 2026
▸ Abstract
While deep neural networks (DNNs) are used for prediction, inference on DNN-estimated subject-specific means for categorical or exponential family outcomes remains underexplored. We address this by proposing a DNN estimator under generalized nonparametric regression models (GNRMs) and developing a rigorous inference framework. Unlike existing approaches that assume independence between estimation errors and inputs to establish the error bound, a condition often violated in GNRMs, we allow for dependence and our theoretical analysis demonstrates the feasibility of drawing inference under GNRMs. To implement inference, we consider an Ensemble Subsampling Method (ESM) that leverages U-statistics and the Hoeffding decomposition to construct reliable confidence intervals for DNN estimates. We show that, under GNRM settings, ESM enables model-free variance estimation and accounts for heterogeneity among individuals in the population. Through simulations under nonparametric logistic, Poisson, and binomial regression models, we demonstrate the effectiveness and efficiency of our method. We further apply the method to the electronic Intensive Care Unit (eICU) dataset, a large scale collection of anonymized health records from ICU patients, to predict ICU readmission risk and offer patient-centric insights for clinical decision making.
Stephan G Frangakis, Xuran Meng, Mark C Bicket, Vidhya Gunaseelan, Sawsan As Sanie, Andrew Urquhart, Yi Li and Chad M Brummett,
Two-part Statistical Model for Identifying Baseline Predictors of Chronic Postsurgical Pain. — Anesthesiology, 2026
▸ Abstract
A substantial proportion of patients report no pain after surgery, resulting in an excess of zero values that pose challenges for analysis using traditional statistical models. The present study was designed to test the hypothesis that a two-part model, commonly used in healthcare expenditures research, would demonstrate superior performance in predicting postsurgical pain when compared to traditional models, and would secondarily better identify predictors of this clinically important outcome.
Xuran Meng, Jingfei Zhang and Yi Li,
Statistical Inference on High Dimensional Gaussian Graphical Regression Models. — Biometrics, 2025
▸ Abstract
Gaussian graphical regressions have emerged as a powerful approach for regressing the precision matrix of a Gaussian graphical model on covariates, which, unlike traditional Gaussian graphical models, can help determine how graphs are modulated by high dimensional subject-level covariates, and recover both the population-level and subject-level graphs. To fit the model, a multi-task learning approach achieves lower error rates compared to node-wise regressions. However, due to the high complexity and dimensionality of the Gaussian graphical regression problem, the important task of statistical inference remains unexplored. We propose a class of debiased estimators based on multi-task learners for statistical inference in Gaussian graphical regressions. We show that debiasing can be performed quickly and separately for the multi-task learners. In a key debiasing step that estimates the inverse covariance matrix, we propose a novel projection technique that dramatically reduces computational costs in optimization to scale only with the sample size $n$. We show that our debiased estimators enjoy a fast convergence rate and asymptotically follow a normal distribution, enabling valid statistical inference such as constructing confidence intervals and performing hypothesis testing. Simulation studies confirm the practical utility of the proposed approach, and we further apply it to analyze gene co-expression graph data from a brain cancer study, revealing meaningful biological relationships.
Xuran Meng and Yi Li,
Xuran Meng and Yi Li's contribution to the Discussion of “On optimal linear prediction” by I. Helland. — Scandinavian Journal of Statistics, 2025
▸ Abstract
Xuran Meng, Yuan Cao and Weichen Wang,
Estimation of Out-of-Sample Sharpe Ratio for High Dimensional Portfolio Optimization. — Journal of the American Statistical Association, 2025
▸ Abstract
Portfolio optimization aims at constructing a realistic portfolio with significant out-of-sample performance, which is typically measured by the out-of-sample Sharpe ratio. However, due to in-sample optimism, it is inappropriate to use the in-sample estimated covariance to evaluate the out-of-sample Sharpe, especially in the high dimensional settings. In this paper, we propose a novel method to estimate the out- of-sample Sharpe ratio using only in-sample data, based on random matrix theory. Furthermore, portfolio managers can use the estimated out-of-sample Sharpe as a criterion to decide the best tuning for constructing their portfolios. Specifically, we consider the classical framework of Markowits mean-variance portfolio optimization with known mean vector and the high dimensional regime of p/n -> c, where p is the portfolio dimension and n is the number of samples or time points. We propose to correct the sample covariance by a regularization matrix and provide a consistent estimator of its Sharpe ratio. The new estimator works well under either of three conditions: (1) bounded covariance spectrum, (2) arbitrary number of diverging spikes when c < 1, and (3) fixed number of diverging spikes when c >= 1. We can also extend the results to construct global minimum variance portfolio and correct out-of-sample efficient frontier.
Xuran Meng, Yuan Cao and Difan Zou,
Per-Example Gradient Regularization Improves Learning Signals from Noisy Data. — Machine Learning, 2025
▸ Abstract
Gradient regularization, as described in Barrett and Dherin (2021), is a highly effective technique for promoting flat minima during gradient descent. Empirical evidence suggests that this regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations, while also reducing test error. In this paper, we explore the per- example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations. Specifically, we adopt a signal-noise data model from Cao et al. (2022) and show that PEGR can learn signals effectively while suppressing noise. In contrast, standard gradient descent struggles to distinguish the signal from the noise, leading to suboptimal generalization performance. Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data. These findings underscore the importance of variance control in deep learning training and offer useful insights for developing more effective training approaches.
Xuran Meng, Jianfeng Yao and Yuan Cao,
Multiple Descent in the Multiple Random Feature Model. — Journal of Machine Learning Research, 2024
▸ Abstract
Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a double random feature model'' (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to themultiple random feature model’’ (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models.
Xuran Meng and Jianfeng Yao,
Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping. — Journal of Machine Learning Research, 2023
▸ Abstract
Much recent research effort has been devoted to explain the success of deep learning. Random Matrix Theory (RMT) provides an emerging way to this end by analyzing the spectra of large random matrices involved in a trained deep neural network (DNN) such as weight matrices or Hessian matrices in the stochastic gradient descent algorithm. To better understand spectra of weight matrices, we conduct extensive experiments on weight matrices under different settings for layers, networks and data sets. Based on the previous work of Martin and Mahoney (2021), spectra of weight matrices at the terminal stage of training are classified into three main types: Light Tail (LT), Bulk Transition period (BT) and Heavy Tail (HT). These different types, especially HT, implicitly indicate some regularization in the DNNs. In this paper, inspired from Martin and Mahoney (2021), we identify the difficulty of the classification problem as an important factor for the appearance of HT in weight matrices spectra. Higher the classification difficulty, higher the chance for HT to appear. Moreover, the classification difficulty can be affected either by the signal-to-noise ratio of the dataset, or by the complexity of the classification problem (complex features, large number of classes) as well. Leveraging on this finding, we further propose a spectral criterion to detect the appearance of HT and use it to early stop the training process without testing data. Such early stopped DNNs have the merit of avoiding overfitting and unnecessary extra training while preserving a much comparable generalization ability. These findings from the paper are validated in several NNs (LeNet, MiniAlexNet and VGG), using Gaussian synthetic data and real data sets (MNIST and CIFAR10).
Jing Zhang, Shuguang Zhang and Xuran Meng,
l1–2 minimisation for compressed sensing with partially known signal support. — Electronics Letters, 2020
▸ Abstract
In this study, we mainly discuss the robust signal recovery by l1−2 minimisation with incorporating prior support information, which is not considered in previous works. A robust recovery condition is established and an recovery error estimation is obtained, in particular, the obtained results generalise the state-of-the-art ones. In addition, by proposing a modified algorithm, the numerical experiments show that incorporating prior support information for l1−2 minimisation exhibits better recovery performance than standard l1−2 minimisation.
Xuran Meng, Xiuchun Bi and Shuguang Zhang,
High frequency algorithm and its back-testing results based on GAN. — JUSTC, 2020
▸ Abstract
In the financial clasification mission,due to the big noise and low information-ratio in financial data, traditional supervised-learning regime may extend the noise influence because of the over dependent on the data label. GAN (generative adversarial networ) can learn the data characters and reduce the influence of noise.When it is used to analyze the financial data,it has great results.We apply GAN to the high frequency trading: set the data labeled or unlabeled based on its volatility,then use the adversarial training between generative network G and discriminative network D to learn the intrinsic characters of the data,finaly use the welltrained D to get the up and down clasification model and the quantization strategy.The sample is based on the future data,and the final results show that the LSTM model training by GAN is better than the deep learning models such as LSTM with supervised training and the Logistic regression model.
Yu Pan, Xuran Meng and Wuqing Ning,
A Local Existence Theorem for a Parabolic Blow-Up Inverse Problem. — Pure Mathematics, 2017
▸ Abstract
In this article, we study an inverse problem for a parabolic equation with blow-up initial and boundaryvaluesinthefollowingform: u −u = f(x)u−b(x,t)up (p>1,0<x<1,0<t<T).The inverse problem is to determine the unknown function f(x) from the blow-up rates and the additional observation data. In order to partly remove the blow-up data, we introduce the definition of δ-line, which allows us to add the observable data and simplifies the inverse problem into a classical one. Then by establishing related functional, we prove a local existence theorem for the inverse problem in somegiven closed domain.

Conference Papers

Yidong Wang et. al.,
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future. — International Conference on Machine Learning, 2026
▸ Abstract
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model’s outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
Chenyang Zhang, Xuran Meng and Yuan Cao,
Transformer learns optimal variable selection in group-sparse classification. — International Conference on Learning Representations, 2025
▸ Abstract
Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with ``group sparsity’’, where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.
Xuran Meng, Difan Zou and Yuan Cao,
Benign Overfitting in Two-Layer ReLU Convolutional Neural Networks for XOR Data. — International Conference on Machine Learning, 2024
▸ Abstract
Modern deep learning models are usually highly over-parameterized so that they can overfit the training data. Surprisingly, such overfitting neural networks can usually still achieve high prediction accuracy. To study this ``benign overfitting’’ phenomenon, a line of recent works has theoretically studied the learning of linear models and two-layer neural networks. However, most of these analyses are still limited to the very simple learning problems where the Bayes-optimal classifier is linear. In this work, we investigate a class of XOR-type classification tasks with label-flipping noises. We show that, under a certain condition on the sample complexity and signal-to-noise ratio, an over-parameterized ReLU CNN trained by gradient descent can achieve near Bayes-optimal accuracy. Moreover, we also establish a matching lower bound result showing that when the previous condition is not satisfied, the prediction accuracy of the obtained CNN is an absolute constant away from the Bayes-optimal rate. Our result demonstrates that CNNs have a remarkable capacity to efficiently learn XOR problems, even in the presence of highly correlated features.

Preprints

Sattwik Ghosal, Xuran Meng and Yi Li,
Beyond Consistency: Inference for the Relative Risk Functional in Deep Nonparametric Cox Models. — Arxiv, 2026
▸ Abstract
There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how pointwise bias can be controlled to permit valid inference, and how ensemble-based uncertainty quantification behaves under realistic variance decay regimes. We develop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained networks that link in-sample optimization error to population risk without requiring the exact empirical risk optimizer. We then construct a structured neural parameterization that achieves infinity-norm approximation rates compatible with the oracle bound, yielding control of the pointwise bias. Under these conditions and using the Hajek–Hoeffding projection, we prove pointwise and multivariate asymptotic normality for subsampled ensemble estimators. We derive a range of subsample sizes that balances bias correction with the requirement that the Hajek–Hoeffding projection remain dominant. This range accommodates decay conditions on the single-overlap covariance, which measures how strongly a single shared observation influences the estimator, and is weaker than those imposed in the subsampling literature. An infinitesimal jackknife representation provides analytic covariance estimation and valid Wald-type inference for relative risk contrasts such as log-hazard ratios. Finally, we illustrate the finite-sample implications of the theory through simulations and a real data application.
Hua Yuan, Xuran Meng et. al.,
Towards Understanding Feature Learning in Parameter Transfer. — Arxiv, 2025
▸ Abstract
Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.
Shuning Shang, Xuran Meng, Yuan Cao and Difan Zou,
Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers. — Arxiv, 2024
▸ Abstract
Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.

Xuran Meng

Publications

Journal Articles

Conference Papers

Preprints