Amirhesam Abedsoltan

PhD Candidate at UC San Diego

About Me

Profile photo

Hi, I'm Amirhesam Abedsoltan, a Ph.D. candidate at UC San Diego. I was lucky to be advised by Prof. Mikhail Belkin. My work broadly focuses on machine learning, spanning both theory and practice, with an interest in scaling methods and understanding how modern AI systems learn and generalize.

Education

Ph.D. Computer ScienceUniversity of California San Diego, 2021 - Present
M.S. Computer ScienceUniversity of Southern California, 2021
B.S. Electrical EngineeringSharif University of Technology, 2018

Recent News

Our paper, “Fast Training of Large Kernel Models with Delayed Projections.” was spotlighted at NeurIPS.

September 2025

In this paper, we significantly reduced the runtime of kernel solvers from days to minutes by using delayed projections in Preconditioned Stochastic Gradient Descent.

I will be an AI Research Intern at Figma.

June 2025

I will be working on multi-agent workflows for prompt-to-design applications extending agent-squad from AWS.

Our paper, “Task Generalization With AutoRegressive Compositional Structure: Can Learning From D Tasks Generalize to Dᵀ Tasks?” was accepted at ICML.

August 2025

We study task generalization through an autoregressive compositional framework, showing that training on only a small subset of tasks enables exponential generalization to a much larger class. As an example, we demonstrate that Transformers trained on sparse parity tasks generalize in-context through chain-of-thought reasoning, and further extend to arithmetic and translation.

Selected Publications

Sort:

Task Generalization With AutoRegressive Compositional Structure: Can Learning From D Tasks Generalize to Dᵀ Tasks?

A Abedsoltan, H Zhang, K Wan, H Lin, J Zhang, M Belkin

In-context learningTask generalizationChain-Of-Thought
2025

42nd International Conference on Machine Learning (ICML2025)

Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of T operations, and each operation is among a finite family of d subtasks. This yields a total class of size d^T. We first show that generalization to all d^T tasks is theoretically achievable by training on only Õ(d) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via In-context Learning (ICL) and chain-of-thought (CoT) reasoning. We further show generalization in arithmetic and translation, beyond parity functions.

Fast training of large kernel models with delayed projections

A Abedsoltan, S Ma, P Pandit, M Belkin

Kernel methodsOptimizationPreconditioning
2025

39th Conference on Neural Information Processing Systems (NeurIPS 2025-Spotlight)

Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. We validate our algorithm, EigenPro4, across multiple datasets, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.

Context-Scaling versus Task-Scaling in In-Context Learning

A Abedsoltan, A Radhakrishnan, J Wu, M Belkin

In-context learningTransformers
2024

arXiv preprint arXiv:2410.12783

Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear regression, teacher-student settings. Furthermore, a single block of our simplified transformer can be viewed as data dependent feature map followed by an MLP. This feature map on its own is a powerful predictor that is capable of context-scaling but is not capable of task-scaling. We show empirically that concatenating the output of this feature map with vectorized data as an input to MLPs enables both context-scaling and task-scaling. This finding provides a simple setting to study context and task-scaling for ICL.

Uncertainty estimation with recursive feature machines

D Gedon, A Abedsoltan, TB Schön, M Belkin

Uncertainty estimationFeature LearningKernel methods
2024

The 40th Conference on Uncertainty in Artificial Intelligence

In conventional regression analysis, predictions are typically represented as point estimates derived from covariates. The Gaussian Process (GP) offer a kernel-based framework that predicts and additionally quantifies associated uncertainties. However, kernel-based methods often underperform ensemble-based decision tree approaches in regression tasks involving tabular and categorical data. Recently, Recursive Feature Machines (RFMs) were proposed as a novel feature-learning kernel which strengthens the capabilities of kernel machines. In this study, we harness the power RFMs in a probabilistic GP-based approach to enhance uncertainty estimation through feature extraction within kernel methods. We employ this learned kernel for in-depth uncertainty analysis. On tabular datasets, our RFM-based method surpasses other leading uncertainty estimation techniques, including NGBoost and CatBoost-ensemble. Additionally, when assessing out-of-distribution performance, we found that boosting-based methods are surpassed by our RFM-based approach.

On the Nyström approximation for preconditioning in kernel machines

A Abedsoltan, P Pandit, L Rademacher, M Belkin

Kernel methodsNyström approximationPreconditioning
2024

International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nyström-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

On Feature Learning of Recursive Feature Machines and Automatic Relevance Determination

D Gedon, A Abedsoltan, TB Schön, M Belkin

Kernel methodsFeature LearningAutomatic relevance
2023

UniReps: the First Workshop on Unifying Representations in Neural Models

Feature learning is a crucial element for the performance of machine learning models. Recently, the exploration of feature learning in the context of kernel methods has led to the introduction of Recursive Feature Machines (RFMs). In this work, we connect diagonal RFMs to Automatic Relevance Determination (ARD) from the Gaussian process literature. We demonstrate that diagonal RFMs, similar to ARD, serve as a weighted covariate selection technique. However, they are trained using different paradigms: RFMs use recursive iterations of the so-called Average Gradient Outer Product, while ARD employs maximum likelihood estimation. Our experiments show that while the learned features in both models correlate highly across various tabular datasets, this correlation is lower for other datasets. Furthermore, we demonstrate that the RFM effectively captures correlation between covariates, and we present instances where the RFM outperforms both ARD and diagonal RFM.

Toward Large Kernel Models

A Abedsoltan, M Belkin, P Pandit

Kernel methodsNyström approximationPreconditioning
2023

40th International Conference on Machine Learning (ICML2023)

Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines model size is tied to data size. Because of this coupling, scaling kernel machines to large data has been computationally challenging. In this paper, we provide a way forward for constructing large-scale general kernel models, which are a generalization of kernel machines that decouples the model and data, allowing training on large datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD and show scaling to model and data sizes which have not been possible with existing kernel methods. We provide a PyTorch based implementation which can take advantage of multiple GPUs.

On emergence of clean-priority learning in early stopped neural networks

C Liu, A Abedsoltan, M Belkin

Neural networksEarly stoppingOptimization
2023

arXiv preprint arXiv:2306.02533

When random label noise is added to a training dataset, the prediction error of a neural network on a label-noise-free test dataset initially improves during early training but eventually deteriorates, following a U-shaped dependence on training time. This behaviour is believed to be a result of neural networks learning the pattern of clean data first and fitting the noise later in the training, a phenomenon that we refer to as clean-priority learning. In this study, we aim to explore the learning dynamics underlying this phenomenon. We theoretically demonstrate that, in the early stage of training, the update direction of gradient descent is determined by the clean subset of training data, leaving the noisy subset has minimal to no impact, resulting in a prioritization of clean learning. Moreover, we show both theoretically and experimentally, as the clean-priority learning goes on, the dominance of the gradients of clean samples over those of noisy samples diminishes, and finally results in a termination of the clean-priority learning and fitting of the noisy samples.

Benign, tempered, or catastrophic: A taxonomy of overfitting

N Mallinar, JB Simon, A Abedsoltan, P Pandit, M Belkin, P Nakkiran

OverfittingGeneralization
2022

36th Conference on Neural Information Processing Systems (NeurIPS 2022)

The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods-- learning methods which are able fit their training data perfectly. Empirically, certain interpolating methods can fit noisy training data without catastrophically bad test performance, which defies standard intuitions from statistical learning theory. Aiming to explain this, a large body of recent work has studied benign overfitting, a behavior seen in certain asymptotic settings under which interpolating methods approach Bayes-optimality, even in the presence of noise. In this work, we argue that, while benign overfitting has been instructive to study, real interpolating methods like deep networks do not fit benignly. That is, noise in the train set leads to suboptimal generalization, suggesting that these methods fall in an intermediate regime between benign and catastrophic overfitting, in which asymptotic risk is neither Bayes-optimal nor unbounded, with the confounding effect of the noise being tempered but non-negligible. We call this behavior tempered overfitting. We first provide broad empirical evidence for our three-part taxonomy, demonstrating that deep neural networks and kernel machines fit to noisy data can be reasonably well classified as benign, tempered, or catastrophic. We then specialize to kernel (ridge) regression (KR), obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors, demonstrating the consequences for KR with common kernels and trained neural networks of infinite width using experiments on natural and synthetic datasets.

Personal Interests

🎾

Tennis

I got into tennis because of Federer's elegance, but somewhere along the way I ended up being blown away by Djokovic's grit and brilliance!

🥾

Hiking

Exploring mountain trails and nature reserves, combining physical activity with mindfulness.

📚

Reading

I'm hooked on thought-provoking reads—check my books page for a peek at some of my latest and all-time favorites.