Generalization
Classical generalization theories heavily rely on the assumption that data is independent and identically distributed (i.i.d.). However, this assumption does not hold in real-world scenarios with sequential data exhibiting temporal dependencies. Therefore, it is crucial to establish sample complexity results based on number of tokens instead of number of sequences for tasks such as natural language.
That is, instead of upper bounds \(\mathrm{Generalization-error} \leq f(N)\), we want to derive \(\mathrm{Generalization-error} \leq f(N,T)\) where \(N\) is the number of sequences, \(T\) is the length of the sequences and \(f\) is some decaying function in its parameters.
Long-Context Linear System Identification in ICLR 2025
Establishes the finite sample complexity of learning the simplest autoregressive system, linear dynamical systems.
On the Sample Complexity of Next-Token Prediction in AISTATS 2025
Extension to discrete Markov chains that mix.
Generalization Bounds for Autoregressive Processes and In-Context Learning in submission
Extension to unbounded autoregressive processes and in-context learning.
Optimization
Generalization is not agnostic to optimization as the optimization algorithm introduces a crucial implicit bias that needs to be taken into account. This bias acts as an constraint on the training trajectory and hence effectively shrinks the class complexity.
Incremental Learning of Sparse Attention Patterns in Transformers in submission
Gradient flow analysis of multi-head single-layer attention.
First-order ANIL provably learns representations despite overparametrization in ICLR 2024
Gradient descent analysis of meta-learning in two-layer linear network.
Data
Understanding the properties of the data is important for two main reasons. First, in non-i.i.d. settings, characterizing data dependencies is essential for deriving concentration bounds. Second, data distribution directly affects the training trajectories of the model, influencing the optimization path.
Incremental Learning of Sparse Attention Patterns in Transformers in submission
Introduced a Markov chain whose temporal structure guides the learning order of the transformer.
Generalization Bounds for Autoregressive Processes and In-Context Learning in submission
On the Sample Complexity of Next-Token Prediction in AISTATS 2025
Introduced rephrasability, a looser version of mixing, to derive generalization bounds.
Feel free to reach out via email or social media if you have any questions!